CN106372063A

CN106372063A - Information processing method and device and terminal

Info

Publication number: CN106372063A
Application number: CN201610940520.XA
Authority: CN
Inventors: 张昊; 谢瑜; 朱频频
Original assignee: Shanghai Zhizhen Intelligent Network Technology Co Ltd
Current assignee: Shanghai Zhizhen Intelligent Network Technology Co Ltd
Priority date: 2016-11-01
Filing date: 2016-11-01
Publication date: 2017-02-01

Abstract

Provided are an information processing method and device and a terminal. The information processing method comprises the steps of segmenting to-be-processed linguistic data, to obtain a plurality of words; performing synonym replacement on at least part of the words so as to obtain new linguistic data; and performing keyword extraction processing on the new linguistic data to obtain one or multiple keywords. According to the technical scheme, the keyword extraction accuracy is improved.

Description

Information processing method, device and terminal

Technical field

The present invention relates to natural language processing field, more particularly, to a kind of information processing method, device and terminal.

Background technology

When extracting key word now, it is mostly the method (as word frequency statisticses) based on statistical nature, or based on text row The method of sequence (text rank).Algorithm operating based on statistical nature is simple.Method based on text sequence is according to word Cooccurrence relation determines the contact between word.

But, the algorithm based on statistical nature can ignore the frequency of occurrences not high or in a document position inessential but for literary composition Shelves have the word of critical significance.Based on text sequence method lack semantic understanding so that same subject but not in same window The word of mouth cannot associate.

Therefore, the accuracy how improving keyword extraction is a problem demanding prompt solution.

Content of the invention

Present invention solves the technical problem that being the accuracy how improving keyword extraction.

For solving above-mentioned technical problem, the embodiment of the present invention provides a kind of information processing method, comprising:

Word segmentation processing is carried out to pending language material, to obtain multiple words；At least a portion of the plurality of word is entered Row synonym is replaced, to obtain new language material；Keyword extraction process is carried out to described new language material, one or more to obtain Key word.

Optionally, described at least a portion to the plurality of word carries out synonym replacement inclusion: according to default synonymous Dictionary determines the synonymous phrase of at least one of at least a portion of the plurality of word, and wherein synonymous word is listed in together In adopted phrase；For each synonymous phrase, it is chosen at word frequency highest word in described pending language material, and other words are replaced It is changed to described word frequency highest word, to obtain described new language material.

Optionally, described at least a portion to the plurality of word carries out also including after synonym replaces it: to described New language material carries out Screening Treatment, processes for keyword extraction.

Optionally, described at least a portion to the plurality of word carries out also including before synonym replaces it: to described Multiple words carry out Screening Treatment, to obtain multiple candidate word.

Optionally, described at least a portion to the plurality of word carries out synonym replacement inclusion: according to default synonymous Dictionary determines the synonymous phrase of at least one of the plurality of candidate word, and wherein synonymous candidate word lists same synonymous phrase in In；For each synonymous phrase, it is chosen at word frequency highest candidate word in described pending language material, and other candidate word are replaced For described word frequency highest candidate word, to obtain described new language material.

Optionally, Screening Treatment is carried out using one or more of mode: screened according to part of speech, reservation noun, Adjective and verb；Screened according to the frequency, retained the word that the frequency is more than frequency threshold value.

Optionally, described keyword extraction that described new language material is carried out processes inclusion: described new language material is united Meter, to obtain word frequency in described pending language material for the described new language material and positional information；By described new language material and its word Frequency and positional information input text rank algorithm, carry out keyword extraction to described pending language material.

Optionally, described information processing method also includes: carries out standard to extracting the one or more of key words obtaining Really property checking, is verified result；According to described the result, each parameter in text rank algorithm is adjusted；Using ginseng Text rank algorithm after number adjustment extracts described key word again, until the result of described key word meets default wanting Ask.

Optionally, also include before word segmentation processing being carried out to pending language material: pretreatment is carried out to described pending language material, To obtain the described pending language material of uniform format.

Optionally, described pretreatment is carried out to described pending language material include: described pending language material is converted to text Form, to obtain text data；Word is preset to described text data filtering, wherein said default word is one or more of: Dirty word, sensitive word and stop words；Described text data after filtering is divided according to punctuate.

Optionally, participle: two-way maximum of dictionary is carried out using one or more of mode to described pending language material Join algorithm, viterbi algorithm, hmm algorithm and crf algorithm.

For solving above-mentioned technical problem, the embodiment of the invention also discloses a kind of information processor, comprising:

Participle unit, is suitable to carry out word segmentation processing to pending language material, to obtain multiple words；Synonym replacement unit, It is suitable to carry out synonym replacement at least a portion of the plurality of word, to obtain new language material；Keyword extracting unit, fits In keyword extraction process is carried out to described new language material, to obtain one or more key words.

Optionally, described synonym replacement unit includes: the first synonymous phrase determination subelement, is suitable to according to default synonymous Dictionary determines the synonymous phrase of at least one of at least a portion of the plurality of word, and wherein synonymous word is listed in together In adopted phrase；First replacement subelement, is suitable to, for each synonymous phrase, be chosen at word frequency highest in described pending language material Word, and other words are replaced with described word frequency highest word, to obtain described new language material.

Optionally, described information processing meanss also include: the first screening unit, are suitable to described new language material is screened Process, process for keyword extraction.

Optionally, described information processing meanss also include: the second screening unit, are suitable to the plurality of word is screened Process, to obtain multiple candidate word.

Optionally, described synonym replacement unit includes: the second synonymous phrase determination subelement, is suitable to according to default synonymous Dictionary determines the synonymous phrase of at least one of the plurality of candidate word, and wherein synonymous candidate word lists same synonymous phrase in In；Second replacement subelement, is suitable to for each synonymous phrase, is chosen at word frequency highest candidate in described pending language material Word, and other candidate word are replaced with described word frequency highest candidate word, to obtain described new language material.

Optionally, described keyword extracting unit includes: statistics subelement, is suitable to described new language material is counted, To obtain word frequency in described pending language material for the described new language material and positional information；Extract subelement, being suitable to will be described new Language material and its word frequency and positional information input text rank algorithm, keyword extraction is carried out to described pending language material.

Optionally, described information processing meanss also include: authentication unit, be suitable to extract obtain one or more of Key word carries out Accuracy Verification, is verified result；Adjustment unit, is suitable to according to described the result, text rank be calculated In method, each parameter is adjusted；Extraction unit, the text rank algorithm after being suitable to using parameter adjustment extracts described key again Word, until the result of described key word meets preset requirement.

Optionally, described information processing meanss also include: pretreatment unit, are suitable to carry out pre- place to described pending language material Reason, to obtain the described pending language material of uniform format.

Optionally, described pretreatment unit includes: form conversion subunit, is suitable to for described pending language material to be converted to literary composition This form, to obtain text data；Filter subelement, be suitable to preset word, wherein said default word to described text data filtering For one or more of: dirty word, sensitive word and stop words；Divide subelement, be suitable to by filter after described text data by Sighting target point is divided.

Optionally, described participle unit carries out participle: word using one or more of mode to described pending language material Allusion quotation self-reinforcing in double directions, viterbi algorithm, hmm algorithm and crf algorithm.

For solving above-mentioned technical problem, the embodiment of the invention also discloses a kind of terminal, described terminal includes described information Processing meanss.

Compared with prior art, the technical scheme of the embodiment of the present invention has the advantages that

Technical solution of the present invention carries out word segmentation processing to pending language material, to obtain multiple words；To the plurality of word At least a portion carry out synonym replacement, to obtain new language material；Keyword extraction process is carried out to described new language material, with Obtain one or more key words.Technical solution of the present invention, before keyword extraction process, carries out information processing, also in advance It is that at least a portion of the plurality of word after to participle carries out synonym replacement so that carrying out at keyword extraction The semantic feature of synonymous vocabulary during reason, can be comprised, count synon contribution when determining key word, it is to avoid ignore appearance frequency Rate is not high but word document to critical significance, and then improves the accuracy of keyword extraction.

Further, described new language material is counted, to obtain described new language material in described pending language material Word frequency and positional information；By described new language material and its word frequency and positional information input textrank algorithm, to described pending Language material carries out keyword extraction.Technical solution of the present invention carries out keyword extraction based on textrank algorithm, can avoid ignoring Position is inessential in a document but word document to critical significance；Meanwhile, by comprising the semantic special of synonymous vocabulary Levy so that same subject but can not associate in the word of the same window, realize automatically extracting the key word of pending language material, and Accuracy rate is high.

Brief description

Fig. 1 is a kind of flow chart of information processing method of the embodiment of the present invention；

Fig. 2 is the flow chart of embodiment of the present invention another kind information processing method；

Fig. 3 is a kind of structural representation of information processor of the embodiment of the present invention；

Fig. 4 is the structural representation of embodiment of the present invention another kind information processor.

Specific embodiment

As described in the background art, to ignore the frequency of occurrences high or in literary composition for the algorithm based on statistical nature of prior art In shelves, position is inessential but word document to critical significance.Semantic understanding is lacked based on the method for text sequence, makes Obtain same subject but cannot not associate in the word of the same window.

Understandable for enabling the above objects, features and advantages of the present invention to become apparent from, below in conjunction with the accompanying drawings to the present invention Specific embodiment be described in detail.

Fig. 1 is a kind of flow chart of information processing method of the embodiment of the present invention.

Information processing method shown in Fig. 1 may comprise steps of:

Step s101: word segmentation processing is carried out to pending language material, to obtain multiple words；

Step s102: synonym replacement is carried out at least a portion of the plurality of word, to obtain new language material；

Step s103: keyword extraction process is carried out to described new language material, to obtain one or more key words.

In being embodied as, described pending language material can be one or more texts.First in step s101, by right Pending language material carries out participle, can obtain multiple words.Then in step s102, to the plurality of word at least one Divide and carry out synonym replacement, obtain new language material.Finally in step s103, described new language material is carried out at keyword extraction Reason, to obtain one or more key words.

The execution sequence that the synonym of the embodiment of the present invention is replaced before the execution sequence that keyword extraction is processed, also It is to say, before keyword extraction is processed, carry out information processing in advance, that is, the plurality of word after to participle At least a portion carries out synonym replacement so that when carrying out keyword extraction process, can comprise the semantic special of synonymous vocabulary Levy, such that it is able to determine key word probability when count synon contribution, it is to avoid ignore the frequency of occurrences not high but for literary composition Shelves have the word of critical significance, and then improve the accuracy of keyword extraction.

In being embodied as, in step s101, can be to be entered to described pending language material using one or more of mode Row participle: dictionary self-reinforcing in double directions, viterbi algorithm, hidden Markov model (hidden markov model, Hmm) algorithm and condition random field algorithm (conditional random field algorithm, crf).

It will be apparent to a skilled person that the algorithm that word segmentation processing adopts can be arbitrarily enforceable algorithm, The embodiment of the present invention is without limitation.

In being embodied as, step s102 may comprise steps of: determines the plurality of word according to default thesaurus The synonymous phrase of at least one of at least a portion, wherein synonymous word lists in same synonymous phrase；For often together Adopted phrase, is chosen at word frequency highest word in described pending language material, and other words is replaced with described word frequency highest Word, to obtain described new language material.Specifically, default thesaurus can be based on Harbin Institute of Technology's " Chinese thesaurus " extended edition Build with each field thesaurus, or can also be according to the synonymicon of other public publications or self-defining synonym Dictionary creation.According to default thesaurus, at least a portion of multiple words is traveled through, determine synonymous phrase；And each Carry out synon replacement in synonymous phrase, will the unified word high for word frequency of all words in each synonymous phrase.For example, " Fructus Lycopersici esculenti " and " Fructus Lycopersici esculenti " belongs to same synonymous phrase, and " Fructus Lycopersici esculenti " word frequency in described pending language material is higher, then this is same " Fructus Lycopersici esculenti " in adopted phrase all replaces with " Fructus Lycopersici esculenti ".

It will be appreciated by those skilled in the art that the embodiment of the present invention, adopts when being replaced operation to synonymous phrase Be " other words are replaced with described word frequency highest word ", thus can reduce the workload of replacement operation；And in tool Arbitrary synonymous word in this synonymous phrase is replaced with during body is implemented or by the word in synonymous phrase, same to ensure In adopted phrase, all words is identical, and the embodiment of the present invention is without limitation.

It should be noted that default thesaurus can also be other arbitrarily enforceable thesaurus, the present invention is implemented Example is without limitation.

In being embodied as, may comprise steps of after step s102: Screening Treatment is carried out to described new language material, Process for keyword extraction.Specifically, can be so that Screening Treatment be carried out using one or more of mode: according to part of speech Screened, retained noun, adjective and verb；Screened according to the frequency, retained the word that the frequency is more than frequency threshold value.? That is, the noun filtering out, adjective, verb or the frequency are more than to the word of frequency threshold value, above-mentioned word is as pass The probability ratio of keyword is larger, and other parts of speech outside above-mentioned word are very little as the probability of key word；Therefore for pass For keyword extraction process, can only consider above-mentioned word, in information processing, can be by other words outside above-mentioned word Filter out, improve execution efficiency.By Screening Treatment, the amount of calculation of keyword extraction step can be reduced, and then improve crucial The speed that word extracts.

In another specific embodiment of the present invention, may comprise steps of before step s102: to described new language material Carry out Screening Treatment, to obtain multiple candidate word.Specifically, can be to be carried out at screening using one or more of mode Reason: screened according to part of speech, retain noun, adjective and verb；Screened according to the frequency, retain the frequency and be more than frequency threshold The word of value.That is, be more than the word of frequency threshold value for the noun filtering out, adjective, verb or the frequency, above-mentioned Word is larger as the probability ratio of key word, and other parts of speech outside above-mentioned word as key word probability very Little；Therefore for keyword extraction process for, can only consider above-mentioned word, in information processing, can by above-mentioned word it Other outer words filter out, and improve execution efficiency.The embodiment of the present invention is passed through to carry out Screening Treatment before step s102, can To reduce the amount of calculation of step s102 and step s103, improve the speed of keyword extraction further.

In being embodied as, after carrying out Screening Treatment, step s102 may comprise steps of: according to default thesaurus Determine the synonymous phrase of at least one of the plurality of candidate word, wherein synonymous candidate word is listed in same synonymous phrase；Right In each synonymous phrase, it is chosen at word frequency highest candidate word in described pending language material, and other candidate word are replaced with institute Predicate frequency highest candidate word, to obtain described new language material.Due to having carried out Screening Treatment in advance, therefore enter in step s102 When row synonym is replaced, data volume to be processed greatly reduces, and execution efficiency is improved.

In being embodied as, step s103 may comprise steps of: described new language material is counted, described to obtain Word frequency in described pending language material for the new language material and positional information；Will be defeated to described new language material and its word frequency and positional information Enter text rank algorithm, keyword extraction is carried out to described pending language material.Specifically, text rank algorithm can be position Put weighting text rank algorithm.Specifically, classical text rank algorithm be by mean of relation between adjacent word directly from Text extracting keywords automatically itself, because without training process, therefore apply more convenient.Position weighting text rank calculates Method, is on the basis of classical text rank, introduces the weighting covering power of influence, position power of influence and frequency power of influence, to calculate Relation between adjacent word.The embodiment of the present invention passes through the synonym replacement operation of step s102, in text rank algorithm On the basis of plus semantic feature extracting key word, can avoid ignoring position in a document inessential but for document, there is pass The word of key meaning；Meanwhile, by comprise synonymous vocabulary semantic feature so that same subject but not in the word of the same window Can associate, realize automatically extracting the key word of pending language material, and accuracy rate is high.

In being embodied as, may comprise steps of before step s101: pretreatment is carried out to described pending language material, To obtain the described pending language material of uniform format.Specifically, described pending language material is converted to text formatting, to obtain literary composition Notebook data；To described text data filtering preset word, wherein said default word be one or more of: dirty word, sensitive word and Stop words；Described text data after filtering is divided according to punctuate.More specifically, can be by the text data after filtering The punctuate of sentence ending in accordance with the instructions, for example, "？”、“！" and "." split and embark on journey and preserve.After the pretreatment of the present embodiment can be It is convenient that the operation of continuous step provides.

In being embodied as, the information processing method shown in Fig. 1 can also comprise the following steps: described that extraction is obtained Individual or multiple key words carry out Accuracy Verification, are verified result；According to described the result to each in text rank algorithm Parameter is adjusted；Extract described key word using the text rank algorithm after parameter adjustment again, until described key word The result meets preset requirement.The embodiment of the present invention, by verifying to keyword extraction result, then adjusts text Each parameter in rank algorithm is so that obtained using the accuracy rate that the text rank algorithm after parameter adjustment carries out keyword extraction Improve further, so that the text rank algorithm after parameter adjustment is applied in the application scenarios of reality.

It should be noted that described preset requirement can be accuracy rate, the concrete numerical value of described preset requirement can basis Actual applied environment is custom-configured and adaptive modification, and the embodiment of the present invention is without limitation.

In a preferred embodiment, information processing method can refer to Fig. 2, and Fig. 2 is that the embodiment of the present invention is another kind of The flow chart of information processing method.

Information processing method shown in Fig. 2 may comprise steps of:

Step s201: pretreatment is carried out to pending language material；

Step s202: word segmentation processing is carried out to pending language material, to obtain multiple words；

Step s203: whether the part of speech judging word is noun or verb or adjective, if it is, enter step S204, otherwise no operates；

Step s204: word is added candidate's dictionary；

Step s205: build default thesaurus；

Step s206: judge whether word wi and word wj is synonym, if it is, entering step s207, otherwise no Operation；

Step s207: judge whether the word frequency of word wi is more than the word frequency of word wj, if it is, entering step s208, Otherwise enter step s209；

Step s208: word wj in pending language material is replaced with word wi；

Step s209: word wi in pending language material is replaced with word wj；

Step s210: weight textrank algorithm using position and keyword extraction is carried out to the pending language material after replacing.

In being embodied as, in step s201, unified to pending language material is text formatting, and filters invalid form, Remove sensitive word, for example, dirty word, sensitive word and stop words；Then big punctuate is pressed to the language material after processing, for example, "？”、“！" and “." split and embark on journey and preserve.

In being embodied as, in step s202, it is possible to use participle engine is carried out to the pending language material of text formatting point Word is processed.Specifically, participle engine can adopt dictionary self-reinforcing in double directions, viterbi algorithm, hidden Markov mould Type (hidden markov model, hmm) algorithm and condition random field algorithm (conditional random field algorithm,crf).

It will be apparent to a skilled person that the algorithm that participle engine adopts can be arbitrarily enforceable algorithm, The embodiment of the present invention is without limitation.

In being embodied as, due to part of speech be noun, the word of adjective and verb larger as the probability ratio of key word, And other parts of speech outside above-mentioned word are very little as the probability of key word；Therefore for keyword extraction is processed, can Only to consider above-mentioned word, thus in step s203, according to part of speech, the multiple words after participle can be screened, and In step s204, by part of speech be noun, the word of adjective and verb add candidate's dictionary.Part of speech is not noun, is described Word and the word of verb, then do not consider, namely no operate.

It is understood that present inventor's embodiment is by the way of being screened according to part of speech.In practical application, also Can be being screened according to the frequency in step s203, and the word in step s204, the frequency being more than frequency threshold value adds Candidate's dictionary.

In being embodied as, in step s205, build default thesaurus, synon for carrying out to candidate's dictionary Judge.Specifically, default thesaurus can be built based on Chinese thesaurus and each field thesaurus.

In another specific embodiment of the present invention, step s205 can also be in the execution of coming of step s201.That is, Before information processing method, build default thesaurus in advance, to reduce the workload of subsequent step.

In being embodied as, in step s206, judge whether word wi and word wj is synonym, if it is, in step Judge in s207 whether the word frequency of word wi is more than the word frequency of word wj, if the word frequency of word wi is more than the word frequency of word wj, Then in step s208, word wj in pending language material is replaced with word wi.If the word frequency of word wi is less than the word of word wj Frequently, then in step s209, word wi in pending language material is replaced with word wj.That is, will synon word each other The unified word high for word frequency of wi and word wj.If word wi and word wj is not for synonym, do not consider, Ye Jiwu Operation.

Furthermore, step s206 to step s209 is a cyclic process, and its operation object is candidate's dictionary.To time Dictionary is selected repeatedly to be circulated (step s206 is to step s209), until all words in traversal candidate's dictionary.So far, candidate In dictionary, synon word is all replaced and completes each other, now can enter next step (step s210).

These parts of speech are larger as the probability ratio of text key word, other parts of speech as key word probability very Little, so directly only considering the word of these parts of speech, improve the execution efficiency of program.

In being embodied as, in step s210, position weighting textrank is utilized to calculate by replacing the pending language material completing Method carries out keyword extraction process, obtains one or more key words.

The embodiment of the present invention, before keyword extraction process, carries out information processing in advance, that is, after to participle The plurality of word at least a portion carry out synonym replacement so that carry out keyword extraction process when, can comprise The semantic feature of synonymous vocabulary, such that it is able to count synon contribution when determining the probability of key word, it is to avoid ignore appearance Frequency is not high but word document to critical significance, and then improves the accuracy of keyword extraction.It is simultaneously based on Text rank algorithm carries out keyword extraction, can avoid ignoring position in a document inessential but have key for document The word of meaning；Meanwhile, by comprise synonymous vocabulary semantic feature so that same subject but can not in the word of the same window To associate, realize automatically extracting the key word of pending language material, and accuracy rate is high.

In being embodied as, for step s201 shown in Fig. 2 to step s210, keyword extraction model can be set up Process.That is, by execution step s201 to step s210, establishing keyword extraction model, this model can be to language Material carries out keyword extraction operation.In order to improve the accuracy of keyword extraction further, after step s210, can also be right Model further optimizes.Specifically, Accuracy Verification is carried out according to keyword extraction result, be verified result；According to Described the result is adjusted to each parameter in text rank algorithm；Using the text rank algorithm after parameter adjustment again Extract described key word, until the result of described key word meets preset requirement.So far, keyword extraction model stability, It is used directly for the extraction to document key word in practical application scene.

The specific embodiment of the embodiment of the present invention can refer to the information processing method shown in Fig. 1, and here is omitted.

Fig. 3 is a kind of structural representation of information processor of the embodiment of the present invention.

Information processor 30 shown in Fig. 3 may include that participle unit 301, synonym replacement unit 302 and key word Extraction unit 303.

Wherein, participle unit 301 is suitable to carry out word segmentation processing to pending language material, to obtain multiple words；

Synonym replacement unit 302 is suitable to carry out synonym replacement at least a portion of the plurality of word, to obtain New language material；

Keyword extracting unit 303 is suitable to carry out keyword extraction process to described new language material, to obtain one or many Individual key word.

In being embodied as, participle unit 301 can be to be carried out to described pending language material using one or more of mode Participle: dictionary self-reinforcing in double directions, viterbi algorithm, hmm algorithm and crf algorithm.

In being embodied as, synonym replacement unit 302 can include the first synonymous phrase determination subelement (not shown) and First replacement subelement (not shown).

Wherein, the first synonymous phrase determination subelement is suitable to determine the plurality of word at least according to default thesaurus The synonymous phrase of at least one of part, wherein synonymous word is listed in same synonymous phrase；First replaces subelement fits In for each synonymous phrase, being chosen at word frequency highest word in described pending language material, and other words are replaced with institute Predicate frequency highest word, to obtain described new language material.Specifically, default thesaurus can be " synonymous based on Harbin Institute of Technology Word word woods " extended edition and each field thesaurus build.According to default thesaurus, at least a portion of multiple words is carried out Traversal, determines synonymous phrase；And carry out synon replacement in each synonymous phrase, will be all in each synonymous phrase The unified word high for word frequency of word.For example, " Fructus Lycopersici esculenti " and " Fructus Lycopersici esculenti " belongs to same synonymous phrase, and " Fructus Lycopersici esculenti " waits to locate described In reason language material, word frequency is higher, then " Fructus Lycopersici esculenti " in this synonymous phrase is all replaced with " Fructus Lycopersici esculenti ".

In being embodied as, information processor 30 can also include the first screening unit (not shown), the first screening unit It is suitable to carry out Screening Treatment to described new language material, process for keyword extraction.Specifically, the first screening unit is permissible Screening Treatment is carried out using one or more of mode: screened according to part of speech, retain noun, adjective and verb；Root Screened according to the frequency, retained the word that the frequency is more than frequency threshold value.That is, for the noun filtering out, adjective, moving Word or the frequency are more than the word of frequency threshold value, and above-mentioned word is larger as the probability ratio of key word, and outside above-mentioned word Other parts of speech very little as the probability of key word；Therefore for keyword extraction is processed, can only consider upper predicate Other words outside above-mentioned word, in information processing, can be filtered out by language, improves execution efficiency.By Screening Treatment, The amount of calculation of keyword extraction step can be reduced, and then improve the speed of keyword extraction.

Specifically, the first screening unit can export the new language material after screening to keyword extracting unit 303.Due to Carry out Screening Treatment in advance, therefore when keyword extracting unit 303 carries out keyword extraction, data volume to be processed is significantly Reduce, execution efficiency is improved.

In being embodied as, keyword extracting unit 303 can include counting subelement (not shown) and extract subelement (figure Do not show).Wherein, count subelement, be suitable to described new language material is counted, wait to locate described to obtain described new language material Word frequency in reason language material and positional information；Extract subelement, be suitable to described new language material and its word frequency and positional information input Text rank algorithm, carries out keyword extraction to described pending language material.

The information processor 30 of the embodiment of the present invention, before the execution sequence that keyword extraction is processed, carries out letter in advance Breath is processed, that is, at least a portion of the plurality of word after to participle carries out synonym replacement so that carrying out When keyword extraction is processed, the semantic feature of synonymous vocabulary can be comprised, such that it is able to count when determining the probability of key word Synon contribution, it is to avoid ignore that the frequency of occurrences is not high but word document to critical significance, and then improve key The accuracy that word extracts.

In being embodied as, information processor 30 can also include pretreatment unit (not shown).Pretreatment unit is suitable to Pretreatment is carried out to described pending language material, to obtain the described pending language material of uniform format.Specifically, pretreatment unit can So that pretreated language material to be exported to participle unit 301.More specifically, pretreatment unit can include form conversion subunit (not shown), filtration subelement (not shown) and division subelement (not shown).

Wherein, form conversion subunit is suitable to described pending language material is converted to text formatting, to obtain text data； Filter subelement to be suitable to preset word to described text data filtering, wherein said default word is one or more of: dirty word, quick Sense word and stop words；Divide subelement to be suitable to be divided the described text data after filtering according to punctuate.

The specific embodiment of the embodiment of the present invention can refer to the information processing method shown in Fig. 1 or Fig. 2, no longer superfluous herein State.

Information processor shown in Fig. 4 can include pretreatment unit 401, participle unit 402, the second screening unit 403rd, synonym replacement unit 404, keyword extracting unit 405, authentication unit 406, adjustment unit 407 and extraction unit 408； Wherein, pretreatment unit 401 can include form conversion subunit 4011, filter subelement 4012 and divide subelement 4013； Synonym replacement unit 404 can include the second same phrase determination subelement 4041 and the second replacement subelement 4042；Crucial Word extraction unit 405 can include counting subelement 4051 and extract subelement 4052.

In being embodied as, form conversion subunit 4011 is suitable to for described pending language material to be converted to text formatting, with To text data；Filter subelement 4012 to be suitable to preset word to described text data filtering, wherein said default word is with next Plant or multiple: dirty word, sensitive word and stop words；Divide subelement 4013 to be suitable to the described text data after filtering according to punctuate Divided.

In being embodied as, the pre-processed results based on pretreatment unit 401 for the participle unit 402 carry out participle, obtain multiple Word.Specifically, participle unit 402 can be to carry out participle: word using one or more of mode to described pending language material Allusion quotation self-reinforcing in double directions, viterbi algorithm, hmm algorithm and crf algorithm.

In being embodied as, the word segmentation result based on participle unit 402 for second screening unit 403 is carried out to the plurality of word Screening Treatment, to obtain multiple candidate word.That is, being more than frequency for the noun filtering out, adjective, verb or the frequency The word of subthreshold, above-mentioned word is larger as the probability ratio of key word, and other parts of speech outside above-mentioned word are as pass The probability of keyword is very little；Therefore for keyword extraction is processed, can only consider above-mentioned word, in information processing, Other words outside above-mentioned word can be filtered out, improve execution efficiency.The embodiment of the present invention is passed through to replace in synonym Carry out Screening Treatment before unit 404, synonym replacement unit 404 and the amount of calculation of keyword extracting unit 405 can be reduced, Improve the speed of keyword extraction further.

In being embodied as, the second synonymous phrase determination subelement 4041 is suitable to be determined according to default thesaurus the plurality of The synonymous phrase of at least one of candidate word, wherein synonymous candidate word is listed in same synonymous phrase；Second replacement subelement 4042 are suitable to for each synonymous phrase, are chosen at word frequency highest candidate word in described pending language material, and by other candidates Word replaces with described word frequency highest candidate word, to obtain described new language material.Due to having carried out Screening Treatment in advance, therefore exist When second synonymous phrase determination subelement 4041 and the second replacement subelement 4042 carry out synonym replacement, data volume to be processed Greatly reduce, execution efficiency is improved.

In being embodied as, statistics subelement 4051 is suitable to described new language material is counted, to obtain described new language Expect the word frequency in described pending language material and positional information；Extract subelement 4051 to be suitable to described new language material and its word frequency Input text rank algorithm with positional information, keyword extraction is carried out to described pending language material.

In the present embodiment, one or more of key words that authentication unit 406 is suitable to extraction is obtained carry out accuracy Checking, is verified result；Adjustment unit 407 is suitable to according to described the result, each parameter in textrank algorithm be adjusted Whole；Text rank algorithm after extraction unit 408 is suitable to using parameter adjustment extracts described key word again, until described pass The result of keyword meets preset requirement.The embodiment of the present invention, by verifying to keyword extraction result, then adjusts Each parameter in text rank algorithm so that carry out the accuracy rate of keyword extraction using the text rank algorithm after parameter adjustment It is further enhanced, so that the text rank algorithm after parameter adjustment is applied in the application scenarios of reality.

The embodiment of the invention also discloses a kind of terminal, described terminal can include the information processor 30 shown in Fig. 3 Or the information processor 40 shown in Fig. 4.Information processor 30 or information processor 40 can be internally integrated in described end End is it is also possible to outside is coupled to described terminal.Described terminal can be robot, smart mobile phone, tablet device etc..

One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can Completed with the hardware instructing correlation by program, this program can be stored in, in computer-readable recording medium, to store Medium may include that rom, ram, disk or CD etc..

Although present disclosure is as above, the present invention is not limited to this.Any those skilled in the art, without departing from this In the spirit and scope of invention, all can make various changes or modifications, therefore protection scope of the present invention should be with claim institute The scope limiting is defined.

Claims

1. a kind of information processing method is it is characterised in that include:

Word segmentation processing is carried out to pending language material, to obtain multiple words；

Synonym replacement is carried out at least a portion of the plurality of word, to obtain new language material；

Keyword extraction process is carried out to described new language material, to obtain one or more key words.

2. information processing method according to claim 1 is it is characterised in that described to the plurality of word at least one Divide and carry out synonym replacement inclusion:

Determine the synonymous phrase of at least one of at least a portion of the plurality of word according to default thesaurus, wherein synonymous Word list in same synonymous phrase；

For each synonymous phrase, it is chosen at word frequency highest word in described pending language material, and other words are replaced with Described word frequency highest word, to obtain described new language material.

3. information processing method according to claim 1 is it is characterised in that described to the plurality of word at least one Divide and carry out also including after synonym replaces it:

Screening Treatment is carried out to described new language material, processes for keyword extraction.

4. information processing method according to claim 1 is it is characterised in that described to the plurality of word at least one Divide and carry out also including before synonym replaces it:

Screening Treatment is carried out to the plurality of word, to obtain multiple candidate word.

5. information processing method according to claim 4 is it is characterised in that described to the plurality of word at least one Divide and carry out synonym replacement inclusion:

The synonymous phrase of at least one of the plurality of candidate word is determined according to default thesaurus, wherein synonymous candidate word arranges Enter in same synonymous phrase；

For each synonymous phrase, it is chosen at word frequency highest candidate word in described pending language material, and other candidate word are replaced It is changed to described word frequency highest candidate word, to obtain described new language material.

6. the information processing method according to claim 3 or 4 is it is characterised in that entered using one or more of mode Row Screening Treatment:

Screened according to part of speech, retained noun, adjective and verb；

Screened according to the frequency, retained the word that the frequency is more than frequency threshold value.

7. information processing method according to claim 1 is it is characterised in that described carry out key word to described new language material Extraction process includes:

Described new language material is counted, to obtain word frequency in described pending language material for the described new language material and position letter Breath；

By described new language material and its word frequency and positional information input text rank algorithm, described pending language material is closed Keyword extracts.

8. information processing method according to claim 7 is it is characterised in that also include:

Carry out Accuracy Verification to extracting the one or more of key words obtaining, be verified result；

According to described the result, each parameter in text rank algorithm is adjusted；

Extract described key word using the text rank algorithm after parameter adjustment again, until the result of described key word Meet preset requirement.

9. before information processing method according to claim 1 is it is characterised in that carry out word segmentation processing to pending language material Also include:

Pretreatment is carried out to described pending language material, to obtain the described pending language material of uniform format.

10. information processing method according to claim 9 it is characterised in that described described pending language material is carried out pre- Process and include:

Described pending language material is converted to text formatting, to obtain text data；

Word is preset to described text data filtering, wherein said default word is one or more of: dirty word, sensitive word and deactivation Word；

Described text data after filtering is divided according to punctuate.

11. information processing methods according to claim 1 are it is characterised in that adopt one or more of mode to institute State pending language material and carry out participle:

Dictionary self-reinforcing in double directions, viterbi algorithm, hmm algorithm and crf algorithm.

A kind of 12. information processors are it is characterised in that include:

Participle unit, is suitable to carry out word segmentation processing to pending language material, to obtain multiple words；

Synonym replacement unit, is suitable to carry out synonym replacement at least a portion of the plurality of word, to obtain new language Material；

Keyword extracting unit, is suitable to carry out keyword extraction process to described new language material, to obtain one or more keys Word.

13. information processors according to claim 12 are it is characterised in that described synonym replacement unit includes:

First synonymous phrase determination subelement, is suitable to be determined at least a portion of the plurality of word according to default thesaurus At least one synonymous phrase, wherein synonymous word lists in same synonymous phrase；

First replacement subelement, is suitable to for each synonymous phrase, is chosen at word frequency highest word in described pending language material, And other words are replaced with described word frequency highest word, to obtain described new language material.

14. information processors according to claim 12 are it is characterised in that also include:

First screening unit, is suitable to carry out Screening Treatment to described new language material, processes for keyword extraction.

15. information processors according to claim 12 are it is characterised in that also include:

Second screening unit, is suitable to carry out Screening Treatment to the plurality of word, to obtain multiple candidate word.

16. information processors according to claim 15 are it is characterised in that described synonym replacement unit includes:

Second synonymous phrase determination subelement, is suitable to determine at least one of the plurality of candidate word according to default thesaurus Synonymous phrase, wherein synonymous candidate word is listed in same synonymous phrase；

Second replacement subelement, is suitable to for each synonymous phrase, is chosen at word frequency highest candidate in described pending language material Word, and other candidate word are replaced with described word frequency highest candidate word, to obtain described new language material.

17. information processors according to claims 14 or 15 are it is characterised in that adopt one or more of mode Carry out Screening Treatment:

Screened according to part of speech, retained noun, adjective and verb；

18. information processors according to claim 12 are it is characterised in that described keyword extracting unit includes:

Statistics subelement, is suitable to described new language material is counted, to obtain described new language material in described pending language material In word frequency and positional information；

Extract subelement, be suitable to, by described new language material and its word frequency and positional information input text rank algorithm, treat to described Process language material and carry out keyword extraction.

19. information processors according to claim 18 are it is characterised in that also include:

Authentication unit, the one or more of key words being suitable to extraction is obtained carry out Accuracy Verification, are verified result；

Adjustment unit, is suitable to according to described the result, each parameter in text rank algorithm is adjusted；

Extraction unit, the text rank algorithm after being suitable to using parameter adjustment extracts described key word again,

Until the result of described key word meets preset requirement.

20. information processors according to claim 12 are it is characterised in that also include:

Pretreatment unit, is suitable to carry out pretreatment to described pending language material, to obtain the described pending language material of uniform format.

21. information processors according to claim 20 are it is characterised in that described pretreatment unit includes:

Form conversion subunit, is suitable to described pending language material is converted to text formatting, to obtain text data；

Filter subelement, be suitable to preset word to described text data filtering, wherein said default word is one or more of: dirty Word, sensitive word and stop words；

Divide subelement, be suitable to be divided the described text data after filtering according to punctuate.

22. information processors according to claim 12 it is characterised in that described participle unit adopt following a kind of or Various ways carry out participle to described pending language material:

A kind of 23. terminals are it is characterised in that include the information processor as described in any one of claim 12 to 22.