CN109753656A - Data processing method, device and storage medium - Google Patents
Data processing method, device and storage medium Download PDFInfo
- Publication number
- CN109753656A CN109753656A CN201811634468.0A CN201811634468A CN109753656A CN 109753656 A CN109753656 A CN 109753656A CN 201811634468 A CN201811634468 A CN 201811634468A CN 109753656 A CN109753656 A CN 109753656A
- Authority
- CN
- China
- Prior art keywords
- corpus
- text
- corpus text
- topic
- word segmentation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims description 16
- 230000011218 segmentation Effects 0.000 claims abstract description 81
- 238000000034 method Methods 0.000 claims abstract description 74
- 238000012545 processing Methods 0.000 claims abstract description 58
- 238000013507 mapping Methods 0.000 claims abstract description 31
- 238000012216 screening Methods 0.000 claims abstract description 17
- 239000000463 material Substances 0.000 claims description 24
- 238000004590 computer program Methods 0.000 claims description 12
- 238000001914 filtration Methods 0.000 claims description 9
- 239000000284 extract Substances 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 6
- 238000010586 diagram Methods 0.000 description 9
- 238000009941 weaving Methods 0.000 description 8
- 238000012550 audit Methods 0.000 description 7
- 235000013305 food Nutrition 0.000 description 7
- 238000012795 verification Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000005291 magnetic effect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000000877 morphologic effect Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- PEDCQBHIVMGVHV-UHFFFAOYSA-N Glycerine Chemical compound OCC(O)CO PEDCQBHIVMGVHV-UHFFFAOYSA-N 0.000 description 1
- 240000007594 Oryza sativa Species 0.000 description 1
- 235000007164 Oryza sativa Nutrition 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000011109 contamination Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000005294 ferromagnetic effect Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 230000007257 malfunction Effects 0.000 description 1
- 235000009566 rice Nutrition 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Landscapes
- Machine Translation (AREA)
Abstract
The embodiment of the invention provides a data method, a data device and a storage medium. The method comprises the following steps: extracting corpus texts from a corpus; performing word segmentation processing on the corpus text according to the first word segmentation dictionary to obtain a processed corpus text; screening out keywords based on the processed corpus text; the keywords are participles which are used as correct answers in the corpus text and establish a mapping relation with the question stem; and generating a title by utilizing the processed corpus text based on the key words.
Description
Technical field
The present invention relates to data processing field more particularly to a kind of data processing methods, device and storage medium.
Background technique
With the development of mobile internet, answering type application increasingly rise, the type application need it is a set of science and
Based on objective problem data library.In the related technology, in each system and platform problem data library topic generation method master
Personnel's hand weaving is relied on, and hand weaving topic needs artificial matching stem and answer, on the one hand, stem and answer matches
It is higher to be easy the probability that error and topic answer are omitted;On the other hand, artificial matched process efficiency is extremely low.
Therefore, it needs to find a kind of technical solution that can automatically generate topic.
Summary of the invention
In view of this, the embodiment of the present invention provides the method, apparatus and storage medium of a kind of data processing, it is automatic to realize
Generate topic.
The embodiment of the invention provides a kind of data processing methods, comprising:
Corpus text is extracted from corpus;
Word segmentation processing is carried out to the corpus text according to first participle dictionary, the corpus text that obtains that treated;
Based on treated corpus text, keyword is filtered out;Wherein, the keyword is in corpus text as correct
Answer and stem establish the participle of mapping relations;
Based on the keyword, treated corpus text generation topic is utilized.
In above scheme, described before extracting corpus text in corpus, the method also includes:
Original language material text is obtained from specific website;
Based on preset rules, the original language material text is filtered, obtains effective corpus text;
Using obtained effective corpus text, corpus is established.
It is described to be based on preset rules in above scheme, the original language material text is filtered, effective corpus text is obtained
This, comprising:
The original language material text is screened according to preset corpus integrity rule, the corpus text after being screened
This;
Character recognition processing is carried out to the corpus text after screening, obtains effective corpus text.
In above scheme, the method also includes:
By segmentation methods, the corpus text in the corpus is segmented, word segmentation result is obtained;
Using obtained word segmentation result, and preset dictionary for word segmentation is combined, establishes the first participle dictionary.
In above scheme, the method also includes:
When determining the not stored topic in problem data library, the topic is saved to the problem data library.
The embodiment of the invention also provides a kind of data processing equipments, comprising:
Extraction unit, for extracting corpus text from corpus;
Word segmentation processing unit is handled for carrying out word segmentation processing to the corpus text according to first participle dictionary
Corpus text afterwards;
Screening unit, for filtering out keyword based on treated corpus text;Wherein, the keyword is corpus
The participle of mapping relations is established in text as correct option and stem;
First generation unit utilizes treated corpus text generation topic for being based on the keyword.
In above scheme, described device further includes the first creating unit, is used for:
Original language material text is obtained from specific website;
To be filtered to the original language material text based on preset rules, effective corpus text is obtained;
Using obtained effective corpus text, corpus is established.
In above scheme, described device further includes the second creating unit, is used for:
By segmentation methods, the corpus text in the corpus is segmented, word segmentation result is obtained;
Using obtained word segmentation result, and preset dictionary for word segmentation is combined, establishes the first participle dictionary.
The embodiment of the present invention provides a kind of data processing equipment again, comprising: processor can located with storage is configured to
The memory of the computer program run on reason device;
Wherein, when the processor is configured to run the computer program, when execution, realizes any of the above-described method and step.
The embodiment of the invention also provides a kind of storage mediums, are stored thereon with computer program, the computer program
The step of any of the above-described method is realized when being executed by processor.
Data processing method, device provided by the embodiment of the present invention and storage medium extract corpus text from corpus
This;Word segmentation processing is carried out to the corpus text according to first participle dictionary, the corpus text that obtains that treated;After processing
Corpus text, filter out keyword;Wherein, the keyword is to establish to map as correct option and stem in corpus text
The participle of relationship;Based on the keyword, treated corpus text generation topic is utilized.In the embodiment of the present invention, pass through sieve
The keyword of choosing establishes the mapping relations of stem and answer automatically, and the mapping relations are accurate and reliable, such stem and answer
The probability that matching error and topic answer are omitted can be effectively reduced;Meanwhile this method can be realized effectively and automatically generate topic
Mesh, within the unit time, the topic relative to hand weaving has obviously odds for effectiveness, to save time cost
And human cost.
Detailed description of the invention
Fig. 1 is the implementation process schematic diagram one of data processing method of the embodiment of the present invention;
Fig. 2 is the implementation process schematic diagram two of data processing method of the embodiment of the present invention;
Fig. 3 is the implementation process schematic diagram that Application Example of the present invention generates topic method;
Fig. 4 is the composed structure schematic diagram of data processing equipment of the embodiment of the present invention;
Fig. 5 is the hardware composed structure schematic diagram of data processing equipment of the embodiment of the present invention.
Specific embodiment
The characteristics of in order to more fully hereinafter understand the embodiment of the present invention and technology contents, with reference to the accompanying drawing to this hair
The realization of bright embodiment is described in detail, appended attached drawing purposes of discussion only for reference, is not used to limit the present invention.
In the related technology, the method that each system and platform generate problem data library relies primarily on personnel's hand weaving topic.
Always have following main some drawbacks:
Although 1, many platforms and application all have the random function of generating topic, it is substantially from existing problem data
It is randomly choosed in library, and existing problem data library is mainly or by personnel's hand weaving;
2, the quality in problem data library and authorized personnel's professional standards are closely bound up, cease with the material that authorized personnel obtains
Correlation, the science and objectivity of topic are unable to get guarantee in problem data library;
3, hand weaving is influenced by subjective factor, and errors and omissions probability is high, standard disunity, is the audit band in later period
Carry out heavy burden, consumes more human and material resources and financial resources;
4, the foundation in problem data library will be undergone from data collection, multiple rings such as arrangement, establishment, classification, audit, storage
Section, the circulation layer by layer of data all may cause Missing data, data contamination and artificial the problems such as distorting;
5, it is low to customize degree, for new demand, related personnel needs the process of study, and is directed to new demand weight
The multiple process that storage is repeatedly collected from data, it is inefficient;
6, the process that problem data library generates can not continue to carry out, to limit the expansion and matter of problem data storage capacity
The raising of amount.
Based on this, in various embodiments of the present invention, topic is automatically generated using the data of magnanimity and generates topic number
According to library.
Data processing method provided in an embodiment of the present invention, which can be realized, automatically generates topic, and then automatically generates topic number
According to library.
Topic mentioned herein is the topic type of objectivity, may include multiple-choice question, True-False, gap-filling questions etc., wherein
Multiple-choice question supports multiple choice.
The embodiment of the present invention provides a kind of data processing method.Fig. 1 is the realization of data processing method of the embodiment of the present invention
Flow diagram, as shown in Figure 1, the described method comprises the following steps:
Step S101 extracts corpus text from corpus.
Here, when practical application, the data stored in the corpus are that really occurred in the actual use of language
Linguistic data;Corpus is the basic resource that linguistry is carried using electronic computer as carrier;Corpus text in corpus
This needs by analysis and processing, useful resource could be become.
Here, the corpus text of extraction can be the corpus text extracted at random from corpus.
Step S102 carries out word segmentation processing to the corpus text according to first participle dictionary, the corpus that obtains that treated
Text.
Here, the first participle dictionary is the dictionary for word segmentation different from default dictionary for word segmentation.
First participle dictionary and default dictionary for word segmentation are described in detail below with reference to actual example.
Assuming that seizing from certain official website to one section of corpus text " imitating LI Xiaopeng to jump ".The corpus text in default dictionary for word segmentation
This is often segmented are as follows: imitation/LI Xiaopeng/jump;And in embodiments of the present invention, by a variety of segmentation methods, to the corpus
In corpus text segmented, obtain word segmentation result;Using obtained word segmentation result, and preset dictionary for word segmentation is combined, built
Found the first participle dictionary.Therefore, in embodiments of the present invention, the corpus text is divided according to first participle dictionary
After word processing, " imitating LI Xiaopeng to jump " more maximum probability is broken down into: " imitation/LI Xiaopeng jumps ".
Here the word segmentation processing to the corpus text is completed, the corpus text for having already passed through participle has been obtained.
Step S103 filters out keyword based on treated corpus text;Wherein, the keyword is corpus text
The middle participle that mapping relations are established as correct option and stem.
Here, the purpose of this step is the participle filtered out from the corpus text as correct option part.
, can be according to rule predetermined of setting a question during realization, treated that corpus text is screened to described,
Filter out keyword;The rule predetermined of setting a question can be the fixed collocation mode of definition, such as " using ... as " solid
The participle that ellipsis in fixed collocation represents filters out the specific corpus text that need to include as keyword, definition, such as by language
The participle containing number is filtered out as keyword in material text, and defining preset rule of setting a question here can be according to practical application feelings
Condition flexible choice, is not especially limited herein.
Step S104 is based on the keyword, utilizes treated corpus text generation topic.
Here, it since the structure type of different topic types is different, needs to seize rule according to different come realize will be after processing
Corpus text generation topic.
For example it for, if the topic type of the topic generated is gap-filling questions, is replaced with designated symbols (such as bracket)
Keyword generates stem, and establishes mapping relations using keyword as answer and stem, the stem that generates here, answer and
Stem and the mapping relations of answer constitute topic.
If the topic type of the topic generated is multiple-choice question, other than replacing keyword with designated symbols (such as bracket), also
It needs additionally to generate alternative wrong option, wherein the generation method of wrong option may include: at correct option (keyword)
On the basis of carry out a certain amount of offset or similar word is replaced, be not especially limited here.Here the stem that generates multinomial is answered
Case and stem and the mapping relations of multinomial answer constitute topic.
It is correct topic for judging result if the topic type of the topic generated is True-False, does not need to carry out crucial
Word replacement processing, directly by treated, corpus text obtains stem;It is the topic of mistake for judging result, needs additional life
Keyword is replaced at wrong answer and obtains stem, and wherein the generation method of wrong answer may include (crucial in: correct option
Word) on the basis of carry out a certain amount of offset or similar word is replaced, be not especially limited here.Record keyword is replaced
Number, and number is replaced according to keyword and establishes the answer of True-False, i.e., when replacement number is 0, True-False answer is correct;
When replacement number is not 0, True-False answer is mistake.Here the mapping relations of the stem, answer and the stem that generate and answer
Constitute topic.
The topic of different topic types is realized by different rules of seizing by the process of corpus text generation topic.
Data processing method provided in an embodiment of the present invention, by extracting corpus text from corpus;According to first point
Word dictionary carries out word segmentation processing to the corpus text, the corpus text that obtains that treated;Based on treated corpus text, sieve
Select keyword;Wherein, the keyword is the participle for establishing mapping relations in corpus text as correct option and stem;Base
In the keyword, treated corpus text generation topic is utilized.In the embodiment of the present invention, by the keyword of screening, certainly
The dynamic mapping relations for establishing stem and answer, and the mapping relations are accurate and reliable, such stem and answer matches malfunction and inscribe
The probability that mesh answer is omitted can be effectively reduced;Simultaneously as this method, which can be realized effectively, automatically generates topic, in unit
In time, the topic relative to hand weaving has obviously odds for effectiveness, thus saved time cost and manpower at
This.
The embodiment of the present invention provides another data processing method.Fig. 2 is another kind of embodiment of the present invention data processing side
The implementation process schematic diagram of method, as shown in Fig. 2, the described method comprises the following steps:
Step S201, establishes corpus.
Here, corpus is the basis established first participle dictionary and generate topic.The embodiment of the present invention can by with
Lower step establishes corpus:
Step a obtains original language material text from specific website.
Here, due to the embodiment of the present invention towards be mass data, in order to guarantee the reliability and normalization of data, this
It must be specific medium that inventive embodiments, which limit its data-interface source, and specific medium here specifically includes that every profession and trade on line
News, blog and paper, report, study course and the literary works etc. issued with professional domain official and authoritative website platform, country
The data such as decree, regulations and case of official and the publication of authoritative institution's line upper mounting plate, the electronic document of multiple format under line.
Here original language material text can be obtained from specific website by crawlers.
Step b is based on preset rules, is filtered to the original language material text, obtains effective corpus text.
Here, default rule includes preset corpus integrity rule and character recognition processing.First according to preset
Corpus integrity rule screens the original language material text, then carries out at character recognition to the corpus text after screening
It manages, the corpus text comprising that can not identify character after filtering screening obtains effective corpus text.
Here, the preset corpus integrity rule, which refers to, is arranged some integralities that must satisfy about to corpus in advance
All primary attributes cannot take null value etc. in beam, such as relationship, can be adjusted here in conjunction with actual needs.
Step c establishes corpus using obtained effective corpus text.
Obtained effective corpus text is saved to corpus.
Here, the foundation of corpus is completed.
It when practical application, has new effective corpus text and constantly saves into corpus, held so corpus is in
The continuous state for updating, enriching constantly.
Step S202 establishes first participle dictionary.
Here, first participle dictionary is the basis that inventive embodiments realize participle, and the abundant degree of first participle dictionary is straight
Connect the order of accuarcy for influencing participle.The embodiment of the present invention can establish first participle dictionary by following steps:
Step a segments the corpus text in the corpus, obtains word segmentation result by segmentation methods.
Here segmentation methods include: the segmentation methods based on string matching, the segmentation methods based on understanding and are based on
The segmentation methods etc. of statistics, various algorithms, which can be used alone, to be applied in combination.
It for example, can be first using reverse maximum matching method (RMM) and maximum matching method (MM) (RMM and MM tool for
Body belongs to participle side's algorithm based on string matching) the corpus text in the corpus is tentatively segmented, then can
(statistics side here is counted with the probability and collocation probability that occur in corpus text according to word, collocations mode
Method particularly belongs to the segmentation methods based on statistics), word segmentation result is obtained according to statistical result.
Step b using obtained word segmentation result, and combines preset dictionary for word segmentation, establishes the first participle dictionary.
Here preset dictionary for word segmentation refers to have been obtained according to existing segmentation methods, can be called at any time for user
Dictionary for word segmentation, the preset dictionary for word segmentation of such as present some mainstreams includes: jieba (comprising 16.6 words very much), IK (packet
Containing 27.5 words very much), mmseg (including 15 words very much), word (including 27.5 words very much), these preset dictionary for word segmentation are general
In the different programming language environment being integrated into the form of functional unit.Word can be both enriched in conjunction with preset dictionary for word segmentation
Allusion quotation can also reduce calculated load amount when creation first participle dictionary.
Here the foundation of first participle dictionary is completed.
When practical application, after first participle dictionary is established, the new participle that can be obtained, new participle can be added to first point
In word dictionary, so the state that first participle dictionary is in continuous updating, enriches constantly.
Wherein, since new participle continuously emerges, and frequency of use is higher within a period of time of appearance, it is possible thereby to
Special marking processing is carried out to new participle, to increase the influence degree of new participle, and then is improved subsequent according to the first participle
Dictionary carries out the efficiency of word segmentation processing to the corpus text.
Step S203 extracts corpus text from the corpus.
Here, when practical application, what is stored in the corpus is the language really occurred in the actual use of language
Say material;Corpus is the basic resource that linguistry is carried using electronic computer as carrier;Corpus text in corpus needs
Will by analysis and processing, useful resource could be become.
Here, corpus text is extracted from well-established corpus, the quality of corpus text can be improved, in order to rear
The generation of continuous topic and the foundation in problem data library.
For example for, a corpus text now being seized from corpus: " Dragon Boat Festival and the Spring Festival, the Ching Ming Festival, mid-autumn
Section and referred to as four great tradition red-letter day of folks of china.Dragon Boat Festival culture influences extensively in the world, in the world some countries and regions
There is the custom for congratulating the Dragon Boat Festival.2006, State Council will be included in first batch of List of National Intangible Cultural Heritage the Dragon Boat Festival.The Dragon Boat Festival
Origin covers ancient astrology culture, humane philosophy etc. content, contains deep abundant cultural connotation.In folk culture
Field, the Chinese common people are food Zongzi, two great tradition etiquette and custom themes of the dragon-boat racing as the Dragon Boat Festival."
Step S204 carries out word segmentation processing to the corpus text according to the first participle dictionary, obtains that treated
Corpus text.
Here, the first participle dictionary is the dictionary for word segmentation different from default dictionary for word segmentation.
First participle dictionary and default dictionary for word segmentation are described in detail below with reference to actual example.
Assuming that seizing from certain official website to one section of corpus text " imitating LI Xiaopeng to jump ".The corpus text in default dictionary for word segmentation
This is often segmented are as follows: imitation/LI Xiaopeng/jump;And in embodiments of the present invention, by a variety of segmentation methods, to the corpus
In corpus text segmented, obtain word segmentation result;Using obtained word segmentation result, and preset dictionary for word segmentation is combined, built
Found the first participle dictionary.Therefore, in embodiments of the present invention, the corpus text is divided according to first participle dictionary
After word processing, " imitating LI Xiaopeng to jump " more maximum probability is broken down into: " imitation/LI Xiaopeng jumps ".
During specific implementation, it can be segmented by the 2-gram of Python (computer programming language) a kind of
Text is resolved into the participle of several text segments by instruction.
Here the word segmentation processing to the corpus text is completed, the corpus text for having already passed through participle has been obtained.
Step S205 filters out keyword based on treated corpus text;Wherein, the keyword is corpus text
The middle participle that mapping relations are established as correct option and stem.
Here, the purpose of this step is the participle filtered out from the corpus text as correct option part.
It, can be according to pre-defining preset rule of setting a question during realization, treated that corpus text is carried out to described
Screening, filters out keyword;The rule predetermined of setting a question can be the fixed collocation mode of definition, such as " ... make
For " participle that represents of the ellipsis in fixed collocation filters out as keyword, defines the specific corpus text that need to include, such as
Participle containing number in corpus text is filtered out as keyword, defining preset rule of setting a question here can be according to actually answering
With situation flexible choice, it is not especially limited herein.
Here further detailed description is carried out to the process of screening keyword by way of example.
For above-mentioned example, that is, the corpus text that extracts be " Dragon Boat Festival and the Spring Festival, the Ching Ming Festival, the Mid-autumn Festival and be known as China
Civil four great traditions red-letter day.Dragon Boat Festival culture influences extensively in the world, and some countries and regions, which also have, in the world congratulates the Dragon Boat Festival
Custom.2006, State Council will be included in first batch of List of National Intangible Cultural Heritage the Dragon Boat Festival.Dragon Boat Festival origin covers Gu
Old star contains deep abundant cultural connotation as culture, humane philosophy etc. content.In folk culture field, middle its people
Crowd is using food Zongzi, dragon-boat racing as the two great tradition etiquette and custom themes of the Dragon Boat Festival ", which is proceeded as follows:
1, definition need to can then be sieved comprising specific corpus text (one of number, Time of Day, place etc.) as rule of setting a question
Select keyword: " 2006 ".
2, the fixed collocation mode of definition (" using ... as " collocation mode, " " ... " " arrange in pairs or groups mould
Formula ...;……;... one of collocation mode etc.) as rule of setting a question, then can filter out keyword: " food Zongzi, match dragon
Boat ".
Step S206 is based on the keyword, utilizes treated corpus text generation topic.
Here it since the structure type of different topic types is different, needs to seize rule according to different and realize treated
Corpus text generation topic.
For example it for, if the topic type of the topic generated is gap-filling questions, is replaced with designated symbols (such as bracket)
Keyword generates stem, and establishes mapping relations using keyword as answer and stem, the stem that generates here, answer and
Stem and the mapping relations of answer constitute topic.
If the topic type of the topic generated is multiple-choice question, other than replacing keyword with designated symbols (such as bracket), also
It needs additionally to generate alternative wrong option, wherein the generation method of wrong option may include: at correct option (keyword)
On the basis of carry out a certain amount of offset or similar word is replaced, be not especially limited here.Here the stem that generates multinomial is answered
Case and stem and the mapping relations of multinomial answer constitute topic.
Here further detailed description is carried out to the process for generating alternative wrong option by way of example.For above-mentioned example
Son proceeds as follows:
If 1, keyword is numeric type, according to the figure pattern of correct option, a certain amount of offset is done, according to setting
Quantity generates other alternate items, as " 2006 " as answer and stem to establish mapping outer in example, and is labeled as correct option, together
Shi Shengcheng " 2005 ", " 2007 " and " 2016 " alternately wrong option, and correct option and alternative wrong option with topic
It is dry to establish mapping relations.
If 2, keyword be such as " food Zongzi, dragon-boat racing " a kind of keyword without obvious mode, can be by corpus
The method of search key match pattern generates other alternative options in library, such as " food Zongzi, dragon-boat racing " can corpus its
The similar words such as " food the rice dumpling, dragon-boat racing ", " food Zongzi, admire the full moon bright " are retrieved in his corpus text, using conduct after verification
Alternative wrong option and correct option establish mapping relations with stem together.
It is correct topic for judging result if the topic type of the topic generated is True-False, does not need to carry out crucial
Word replacement processing, directly by treated, corpus text obtains stem;It is the topic of mistake for judging result, needs additional life
Keyword is replaced at wrong answer and obtains stem, and wherein the generation method of wrong answer may include: (crucial in correct option
Word) on the basis of carry out a certain amount of offset or similar word is replaced, be not especially limited here.Record keyword is replaced
Number, and number is replaced according to keyword and establishes the answer of True-False, i.e., when replacement number is 0, True-False answer is correct;
When replacement number is not 0, True-False answer is mistake.Here the mapping of stem, answer and the stem and multinomial answer that generate
Relationship constitutes topic.
The topic of different topic types is realized by different rules of seizing by the process of corpus text generation topic.
Step S207, the topic based on generation generate problem data library.
When determining the not stored topic in problem data library, the topic is saved to the problem data library.
In one embodiment, the topic of generation will also be by the processing such as data cleansing, and carrying out that treated, topic could be protected
It deposits to problem data library, and can also be modified and improve problem data library according to the feedback of third-party application and platform.
Here, when practical application, the data cleansing may include:
Semantic audit is carried out to the topic of the generation, to the topic that semantic audit passes through, carries out integrality and correctness
Verification.
Data processing method provided by the embodiment of the present invention initially sets up corpus and first participle dictionary, then from
Corpus text is extracted in corpus;Word segmentation processing is carried out to the corpus text according to first participle dictionary, obtains that treated
Corpus text;Based on treated corpus text, keyword is filtered out;Wherein, the keyword is in corpus text as just
True answer and stem establish the participle of mapping relations;Based on the keyword, treated corpus text generation topic is utilized.Most
Afterwards based on the topic of generation, problem data library is generated.In the embodiment of the present invention, by the keyword of screening, topic is established automatically
The dry mapping relations with answer, and the mapping relations are accurate and reliable, such stem and answer matches error and topic answer are omitted
Probability can be effectively reduced;Meanwhile this method can be realized effectively and automatically generate topic, within the unit time, relative to
The topic of hand weaving has obviously odds for effectiveness, to save time cost and human cost.
In addition, the data source of the corpus of foundation is in specific website, such as professional domain official in the embodiment of the present invention
Platform publication, to ensure that the reliability and normalization of data source, the science of the topic thus generated and problem data library
Property and it is authoritative higher;The first participle dictionary of foundation than the participle of preset dictionary for word segmentation the property of can refer to more preferably, to make
The accuracy of word segmentation processing is higher, and the accuracy of the topic and problem data library that thereby ensure that generation is also higher.
In addition to this, problem data library can be modified according to the feedback of third-party application and platform, i.e. problem data
Library is continuously running duty, and thus problem data library iteration can update in time, and timeliness is stronger.
The present invention is described in further detail again below with reference to Application Example.
Application Example of the present invention provides a kind of topic generation method.Fig. 3 is that Application Example of the present invention generates topic side
The implementation process schematic diagram of method, as shown in figure 3, the described method comprises the following steps:
Step S301 acquires data according to default rule.
Here, the data source capability official platform of acquisition and authoritative institution's publication, to ensure the reliability and specification of data
Property, and crawler and data acquisition are carried out to target data source by crawlers.
Here, default rule refers to the rule with unified standard format prepared in advance, the rule supports to increase,
The basic operations such as deletion, modification, inquiry, and the rule can carry out permutation and combination according to demand.It, can be with by setting rule
Collect targeted data.
Step S302 is filtered according to data of the default rule to acquisition, obtains filtered data.
Here, primary filtration, mistake are carried out to the data of acquisition according to by the related decree regulation of country and integrity rule
The corpus text for filtering the sensitive corpus text for being not suitable for formally issuing and lacking integrality further can also be to filtering
Corpus text afterwards carries out character recognition processing, the corpus text comprising that can not identify character after filtering screening, to obtain
Legal, available data.
Here it is possible to form corpus using filtered data.
Step S303 carries out word segmentation processing to filtered data, and verifies to word segmentation result, obtains at participle
Data after reason.
Here, it during specific implementation, can be realized by following steps:
Step a carries out word segmentation processing to filtered data by the 2-gram instruction of Python;Obtain word segmentation result;
Step b replaces point for meeting official or authoritative institution's publication using regular expression to obtained word segmentation result
Word, to guarantee that participle used in data is consistent with official or authoritative institution's publication.
Here, the data of word segmentation processing have been obtained having carried out.
Step S304 filters out keyword based on the data after obtained word segmentation processing;Wherein, the keyword is language
The participle of mapping relations is established in material text as correct option and stem.
Step S305 is based on the keyword, using treated corpus text, generates topic in conjunction with morphological analysis.
Here, the topic type of generation can be multiple-choice question, True-False or gap-filling questions etc..
Specifically, if the topic type of the topic generated is gap-filling questions, keyword is replaced with designated symbols (such as bracket)
Generate stem, and establish mapping relations using keyword as answer and stem, the stem, answer and the stem that generate here and
The mapping relations of answer constitute topic.
If the topic type of the topic generated is multiple-choice question or True-False, keyword is replaced with designated symbols (such as bracket)
Generate stem, it is also necessary to spare wrong answer is additionally generated, the generation method of the spare wrong answer is morphological analysis,
It may include: to carry out a certain amount of offset or similar word replacement on the basis of correct option (keyword), do not make to have here
Body limits.Here the stem, answer and the stem that generate and the mapping relations of answer constitute topic.
Step S306 is filtered according to topic of the default rule to generation, obtains filtered topic.
Here, default rule include semantic audit and integrality, correctness verification and whether the detection of multiplicity.This
In semantic audit is carried out to the topic of generation first;Then the topic passed through to semantic audit carries out integrality and correctness school
It tests, then multiplicity confirmation is carried out with generated topic to the topic that verification passes through, duplicate topic is filtered out, after obtaining filtering
Topic.
Step S307 classifies to obtained filtered topic, and is saved the topic according to the result of classification
To problem data library.
Here, classify first to obtained filtered topic, then establish classified index, and root for the topic
The topic is saved into problem data library according to the result of classification.After being easy for for the purpose that the topic establishes classified index
The maintenance and upgrade in continuous problem data library.
Step S308 establishes unified external interface to the problem data library.
Here, unified external interface is established to the problem data library, and the external interface is backward compatible.When
Externally publication problem data library when, be all made of the unified external interface, in order to third-party application and platform access and
It calls.And the downward compatibility of external interface ensure that, the problem data library of upgrade version remains to access before supporting upgrading
The normal use of third-party application and platform.
Step S309 is modified the problem data library and perfect according to the feedback of third-party application and platform.
According to the feedback of third-party application and platform, under the premise of validity is fed back in verification, to the problem data library
It is modified and perfect.
Here feedback can be applied to step S306, after being modified to the topic of feedback, continue subsequent step.
Data processing method provided in an embodiment of the present invention is generated to the stream that classification storage saves from data acquisition, topic
Journey has been all made of automatic flow, to effectively prevent data loss of data as caused by artificial, pollution during circulation
The problems such as with distorting.In addition, in the embodiment of the present invention, by the i.e. customizable topic of setting extracting rule, to obtain topic number
According to library, and complicated data can satisfy by the combination of rule and extract scene.Problem data library is with unified to external
Mouthful, to make problem data library convenient for the access and calling of third-party application and platform.
Method in order to realize the embodiment of the present invention, the embodiment of the present invention also provide a kind of data processing equipment, and Fig. 4 is this
The composed structure schematic diagram of inventive embodiments device, as shown in figure 4, described device 40 includes: extraction unit 41, word segmentation processing list
Member 42, screening unit 43 and the first generation unit 44, in which:
The extraction unit 41 is configured to extract corpus text from corpus;
The word segmentation processing unit 42 is configured to carry out word segmentation processing to the corpus text according to first participle dictionary,
The corpus text that obtains that treated;
The screening unit 43, the corpus text that is configured to that treated filter out keyword;Wherein, the key
Word is the participle for establishing mapping relations in corpus text as correct option and stem;
First generation unit 44, is configured to the keyword, utilizes treated corpus text generation topic.
In one embodiment, described device 40 further includes the first creating unit, is used for:
Original language material text is obtained from specific website;
Based on preset rules, the original language material text is filtered, obtains effective corpus text;
Using obtained effective corpus text, corpus is established.
In one embodiment, first creating unit, is specifically used for:
The original language material text is screened according to preset corpus integrity rule, the corpus text after being screened
This;
Character recognition processing is carried out to the corpus text after screening, the corpus text after filtering screening obtains effective corpus
Text.
In one embodiment, described device 40 further includes the second creating unit, is used for:
By segmentation methods, the corpus text in the corpus is segmented, word segmentation result is obtained;
Using obtained word segmentation result, and preset dictionary for word segmentation is combined, establishes the first participle dictionary.
In one embodiment, described device 40 further includes the second generation unit, is used for,
When determining the not stored topic in problem data library, the topic is saved to the problem data library.
In one embodiment, second generation unit, is specifically used for:
Data cleansing processing is carried out to the topic of generation, cleaned topic is saved to problem data library.
When practical application, the extraction unit 41, word segmentation processing unit 42, screening unit 43, the first generation unit 44,
One creating unit, the second creating unit and the second generation unit can be realized by the processor in data processing equipment.
It should be understood that data processing equipment provided by the above embodiment is when carrying out data processing, only with above-mentioned each
The division progress of program module can according to need for example, in practical application and distribute above-mentioned processing by different journeys
Sequence module is completed, i.e., the internal structure of device is divided into different program modules, to complete whole described above or portion
Divide processing.In addition, data processing equipment provided by the above embodiment and data processing method embodiment belong to same design, have
Body realizes that process is detailed in embodiment of the method, and which is not described herein again.
Based on the hardware realization of above procedure module, and the method in order to realize the embodiment of the present invention, the embodiment of the present invention
A kind of data processing equipment is provided, as shown in figure 5, described device 50 includes: processor 51 and is configured to store and can handle
The memory 52 of the computer program run on device, in which:
The processor 51 executes said one when being configured to run the computer program or multiple technical solutions provide
Method.
When practical application, as shown in figure 5, the various components in described device 50 are coupled by bus system 53.
It is understood that bus system 53 is for realizing the connection communication between these components.Bus system 53 remove include data/address bus it
It outside, further include power bus, control bus and status signal bus in addition.It, will be various total in Fig. 5 but for the sake of clear explanation
Line is all designated as bus system 53.
In the exemplary embodiment, the embodiment of the invention also provides a kind of storage mediums, are computer-readable storage mediums
Matter, the memory 52 for example including computer program, above-mentioned computer program can be held by the processor 51 of data processing equipment 50
Row, to complete step described in preceding method.Computer readable storage medium can be magnetic RAM (FRAM,
Ferromagnetic random access memory), read-only memory (ROM, Read Only Memory), it is programmable only
Read memory (PROM, Programmable Read-Only Memory), Erasable Programmable Read Only Memory EPROM (EPROM,
Erasable Programmable Read-Only Memory), electrically erasable programmable read-only memory (EEPROM,
Electrically Erasable Programmable Read-Only Memory), flash memory (Flash
Memory), magnetic surface storage, CD or CD-ROM (CD-ROM, Compact Disc Read-Only Memory) etc.
Memory.
It should be understood that it should be understood that " first ", " second " etc. are to be used to distinguish similar objects, without
It is used to describe a particular order or precedence order.
In addition, between technical solution documented by the embodiment of the present invention, it in the absence of conflict, can be in any combination.
The foregoing is only a preferred embodiment of the present invention, is not intended to limit the scope of the present invention, it is all
Made any modifications, equivalent replacements, and improvements etc. within the spirit and principles in the present invention, should be included in protection of the invention
Within the scope of.
Claims (10)
1. a kind of data processing method, which is characterized in that the described method includes:
Corpus text is extracted from corpus;
Word segmentation processing is carried out to the corpus text according to first participle dictionary, the corpus text that obtains that treated;
Based on treated corpus text, keyword is filtered out;Wherein, the keyword is that correct option is used as in corpus text
The participle of mapping relations is established with stem;
Based on the keyword, treated corpus text generation topic is utilized.
2. the method according to claim 1, wherein it is described from corpus extract corpus text before, institute
State method further include:
Original language material text is obtained from specific website;
Based on preset rules, the original language material text is filtered, obtains effective corpus text;
Using obtained effective corpus text, corpus is established.
3. according to the method described in claim 2, it is characterized in that, described be based on preset rules, to the original language material text
It is filtered, obtains effective corpus text, comprising:
The original language material text is screened according to preset corpus integrity rule, the corpus text after being screened;
Character recognition processing is carried out to the corpus text after screening, obtains effective corpus text.
4. the method according to claim 1, wherein the method also includes:
By segmentation methods, the corpus text in the corpus is segmented, word segmentation result is obtained;
Using obtained word segmentation result, and preset dictionary for word segmentation is combined, establishes the first participle dictionary.
5. the method according to claim 1, wherein the method also includes:
When determining the not stored topic in problem data library, the topic is saved to the problem data library.
6. a kind of data processing equipment characterized by comprising
Extraction unit, for extracting corpus text from corpus;
Word segmentation processing unit, for, to corpus text progress word segmentation processing, being obtained according to first participle dictionary, treated
Corpus text;
Screening unit, for filtering out keyword based on treated corpus text;Wherein, the keyword is corpus text
The middle participle that mapping relations are established as correct option and stem;
First generation unit utilizes treated corpus text generation topic for being based on the keyword.
7. device according to claim 6, which is characterized in that described device further includes the first creating unit, is used for:
Original language material text is obtained from specific website;
To be filtered to the original language material text based on preset rules, effective corpus text is obtained;
Using obtained effective corpus text, corpus is established.
8. device according to claim 6, which is characterized in that described device further includes the second creating unit, is used for:
By segmentation methods, the corpus text in the corpus is segmented, word segmentation result is obtained;
Using obtained word segmentation result, and preset dictionary for word segmentation is combined, establishes the first participle dictionary.
9. a kind of data processing equipment characterized by comprising processor and be configured to storage and can run on a processor
The memory of computer program;
Wherein, when the processor is configured to run the computer program, perform claim requires any one of 1 to 5 the method
The step of.
10. a kind of storage medium, is stored thereon with computer program, which is characterized in that the computer program is held by processor
The step of any one of claim 1 to 5 the method is realized when row.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811634468.0A CN109753656A (en) | 2018-12-29 | 2018-12-29 | Data processing method, device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811634468.0A CN109753656A (en) | 2018-12-29 | 2018-12-29 | Data processing method, device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109753656A true CN109753656A (en) | 2019-05-14 |
Family
ID=66404351
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811634468.0A Pending CN109753656A (en) | 2018-12-29 | 2018-12-29 | Data processing method, device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109753656A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110175332A (en) * | 2019-06-03 | 2019-08-27 | 山东浪潮人工智能研究院有限公司 | A kind of intelligence based on artificial neural network is set a question method and system |
CN112800200A (en) * | 2021-01-26 | 2021-05-14 | 广州欢网科技有限责任公司 | Program title compiling method, device and equipment |
CN113177117A (en) * | 2021-03-18 | 2021-07-27 | 深圳市北科瑞讯信息技术有限公司 | News material acquisition method and device, storage medium and electronic device |
CN113221558A (en) * | 2021-05-28 | 2021-08-06 | 中邮信息科技(北京)有限公司 | Express delivery address error correction method and device, storage medium and electronic equipment |
CN113505195A (en) * | 2021-06-24 | 2021-10-15 | 作业帮教育科技(北京)有限公司 | Knowledge base, construction method and retrieval method thereof, and question setting method and system based on knowledge base |
CN113627137A (en) * | 2021-10-11 | 2021-11-09 | 江西软云科技股份有限公司 | Question generation method, question generation system, storage medium and equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103136302A (en) * | 2011-12-05 | 2013-06-05 | 北大方正集团有限公司 | Method and device of test question repeat output |
JP2015215681A (en) * | 2014-05-08 | 2015-12-03 | 日本放送協会 | Keyword extraction device and program |
CN106409041A (en) * | 2016-11-22 | 2017-02-15 | 深圳市鹰硕技术有限公司 | Generation method and system for gap filling test question and grading method and system for gap filling test paper |
-
2018
- 2018-12-29 CN CN201811634468.0A patent/CN109753656A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103136302A (en) * | 2011-12-05 | 2013-06-05 | 北大方正集团有限公司 | Method and device of test question repeat output |
JP2015215681A (en) * | 2014-05-08 | 2015-12-03 | 日本放送協会 | Keyword extraction device and program |
CN106409041A (en) * | 2016-11-22 | 2017-02-15 | 深圳市鹰硕技术有限公司 | Generation method and system for gap filling test question and grading method and system for gap filling test paper |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110175332A (en) * | 2019-06-03 | 2019-08-27 | 山东浪潮人工智能研究院有限公司 | A kind of intelligence based on artificial neural network is set a question method and system |
CN112800200A (en) * | 2021-01-26 | 2021-05-14 | 广州欢网科技有限责任公司 | Program title compiling method, device and equipment |
CN113177117A (en) * | 2021-03-18 | 2021-07-27 | 深圳市北科瑞讯信息技术有限公司 | News material acquisition method and device, storage medium and electronic device |
CN113221558A (en) * | 2021-05-28 | 2021-08-06 | 中邮信息科技(北京)有限公司 | Express delivery address error correction method and device, storage medium and electronic equipment |
CN113221558B (en) * | 2021-05-28 | 2023-09-19 | 中邮信息科技(北京)有限公司 | Express address error correction method and device, storage medium and electronic equipment |
CN113505195A (en) * | 2021-06-24 | 2021-10-15 | 作业帮教育科技(北京)有限公司 | Knowledge base, construction method and retrieval method thereof, and question setting method and system based on knowledge base |
CN113627137A (en) * | 2021-10-11 | 2021-11-09 | 江西软云科技股份有限公司 | Question generation method, question generation system, storage medium and equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109753656A (en) | Data processing method, device and storage medium | |
CN106055541B (en) | A kind of news content filtering sensitive words method and system | |
CN102591854B (en) | For advertisement filtering system and the filter method thereof of text feature | |
CN104050224B (en) | Combining different type coercion components for deferred type evaluation | |
CN109885824A (en) | A kind of Chinese name entity recognition method, device and the readable storage medium storing program for executing of level | |
CN112749284B (en) | Knowledge graph construction method, device, equipment and storage medium | |
CN104966031A (en) | Method for identifying permission-irrelevant private data in Android application program | |
CN108228571B (en) | Method and device for generating couplet, storage medium and terminal equipment | |
CN105740227A (en) | Genetic simulated annealing method for solving new words in Chinese segmentation | |
CN109408811A (en) | A kind of data processing method and server | |
CN105468744A (en) | Big data platform for realizing tax public opinion analysis and full text retrieval | |
CN113434685B (en) | Information classification processing method and system | |
CN110189751A (en) | Method of speech processing and equipment | |
CN106843941A (en) | Information processing method, device and computer equipment | |
CN105095436A (en) | Automatic modeling method for data of data sources | |
CN107679075A (en) | Method for monitoring network and equipment | |
CN112948664A (en) | Method and system for automatically processing sensitive words | |
Sun et al. | Design and Application of an AI‐Based Text Content Moderation System | |
CN117520503A (en) | Financial customer service dialogue generation method, device, equipment and medium based on LLM model | |
JP2010277409A (en) | Representative sentence extracting device and program | |
CN109816038A (en) | A kind of Internet of Things firmware program classification method and its device | |
CN109672586A (en) | A kind of DPI service traffics recognition methods, device and computer readable storage medium | |
CN111736804B (en) | Method and device for identifying App key function based on user comment | |
CN106708922A (en) | Character relation atlas analysis method based on mass data | |
CN111008285B (en) | Author disambiguation method based on thesis key attribute network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |