CN109753656A - Data processing method, device and storage medium - Google Patents

Data processing method, device and storage medium Download PDF

Info

Publication number
CN109753656A
CN109753656A CN201811634468.0A CN201811634468A CN109753656A CN 109753656 A CN109753656 A CN 109753656A CN 201811634468 A CN201811634468 A CN 201811634468A CN 109753656 A CN109753656 A CN 109753656A
Authority
CN
China
Prior art keywords
corpus
text
corpus text
topic
word segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811634468.0A
Other languages
Chinese (zh)
Inventor
杨振
赵婷婷
丁昊
李鹤
陈明飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Migu Cultural Technology Co Ltd
China Mobile Communications Group Co Ltd
MIGU Interactive Entertainment Co Ltd
Original Assignee
Migu Cultural Technology Co Ltd
China Mobile Communications Group Co Ltd
MIGU Interactive Entertainment Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Migu Cultural Technology Co Ltd, China Mobile Communications Group Co Ltd, MIGU Interactive Entertainment Co Ltd filed Critical Migu Cultural Technology Co Ltd
Priority to CN201811634468.0A priority Critical patent/CN109753656A/en
Publication of CN109753656A publication Critical patent/CN109753656A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a data method, a data device and a storage medium. The method comprises the following steps: extracting corpus texts from a corpus; performing word segmentation processing on the corpus text according to the first word segmentation dictionary to obtain a processed corpus text; screening out keywords based on the processed corpus text; the keywords are participles which are used as correct answers in the corpus text and establish a mapping relation with the question stem; and generating a title by utilizing the processed corpus text based on the key words.

Description

A kind of data processing method, device and storage medium
Technical field
The present invention relates to data processing field more particularly to a kind of data processing methods, device and storage medium.
Background technique
With the development of mobile internet, answering type application increasingly rise, the type application need it is a set of science and Based on objective problem data library.In the related technology, in each system and platform problem data library topic generation method master Personnel's hand weaving is relied on, and hand weaving topic needs artificial matching stem and answer, on the one hand, stem and answer matches It is higher to be easy the probability that error and topic answer are omitted;On the other hand, artificial matched process efficiency is extremely low.
Therefore, it needs to find a kind of technical solution that can automatically generate topic.
Summary of the invention
In view of this, the embodiment of the present invention provides the method, apparatus and storage medium of a kind of data processing, it is automatic to realize Generate topic.
The embodiment of the invention provides a kind of data processing methods, comprising:
Corpus text is extracted from corpus;
Word segmentation processing is carried out to the corpus text according to first participle dictionary, the corpus text that obtains that treated;
Based on treated corpus text, keyword is filtered out;Wherein, the keyword is in corpus text as correct Answer and stem establish the participle of mapping relations;
Based on the keyword, treated corpus text generation topic is utilized.
In above scheme, described before extracting corpus text in corpus, the method also includes:
Original language material text is obtained from specific website;
Based on preset rules, the original language material text is filtered, obtains effective corpus text;
Using obtained effective corpus text, corpus is established.
It is described to be based on preset rules in above scheme, the original language material text is filtered, effective corpus text is obtained This, comprising:
The original language material text is screened according to preset corpus integrity rule, the corpus text after being screened This;
Character recognition processing is carried out to the corpus text after screening, obtains effective corpus text.
In above scheme, the method also includes:
By segmentation methods, the corpus text in the corpus is segmented, word segmentation result is obtained;
Using obtained word segmentation result, and preset dictionary for word segmentation is combined, establishes the first participle dictionary.
In above scheme, the method also includes:
When determining the not stored topic in problem data library, the topic is saved to the problem data library.
The embodiment of the invention also provides a kind of data processing equipments, comprising:
Extraction unit, for extracting corpus text from corpus;
Word segmentation processing unit is handled for carrying out word segmentation processing to the corpus text according to first participle dictionary Corpus text afterwards;
Screening unit, for filtering out keyword based on treated corpus text;Wherein, the keyword is corpus The participle of mapping relations is established in text as correct option and stem;
First generation unit utilizes treated corpus text generation topic for being based on the keyword.
In above scheme, described device further includes the first creating unit, is used for:
Original language material text is obtained from specific website;
To be filtered to the original language material text based on preset rules, effective corpus text is obtained;
Using obtained effective corpus text, corpus is established.
In above scheme, described device further includes the second creating unit, is used for:
By segmentation methods, the corpus text in the corpus is segmented, word segmentation result is obtained;
Using obtained word segmentation result, and preset dictionary for word segmentation is combined, establishes the first participle dictionary.
The embodiment of the present invention provides a kind of data processing equipment again, comprising: processor can located with storage is configured to The memory of the computer program run on reason device;
Wherein, when the processor is configured to run the computer program, when execution, realizes any of the above-described method and step.
The embodiment of the invention also provides a kind of storage mediums, are stored thereon with computer program, the computer program The step of any of the above-described method is realized when being executed by processor.
Data processing method, device provided by the embodiment of the present invention and storage medium extract corpus text from corpus This;Word segmentation processing is carried out to the corpus text according to first participle dictionary, the corpus text that obtains that treated;After processing Corpus text, filter out keyword;Wherein, the keyword is to establish to map as correct option and stem in corpus text The participle of relationship;Based on the keyword, treated corpus text generation topic is utilized.In the embodiment of the present invention, pass through sieve The keyword of choosing establishes the mapping relations of stem and answer automatically, and the mapping relations are accurate and reliable, such stem and answer The probability that matching error and topic answer are omitted can be effectively reduced;Meanwhile this method can be realized effectively and automatically generate topic Mesh, within the unit time, the topic relative to hand weaving has obviously odds for effectiveness, to save time cost And human cost.
Detailed description of the invention
Fig. 1 is the implementation process schematic diagram one of data processing method of the embodiment of the present invention;
Fig. 2 is the implementation process schematic diagram two of data processing method of the embodiment of the present invention;
Fig. 3 is the implementation process schematic diagram that Application Example of the present invention generates topic method;
Fig. 4 is the composed structure schematic diagram of data processing equipment of the embodiment of the present invention;
Fig. 5 is the hardware composed structure schematic diagram of data processing equipment of the embodiment of the present invention.
Specific embodiment
The characteristics of in order to more fully hereinafter understand the embodiment of the present invention and technology contents, with reference to the accompanying drawing to this hair The realization of bright embodiment is described in detail, appended attached drawing purposes of discussion only for reference, is not used to limit the present invention.
In the related technology, the method that each system and platform generate problem data library relies primarily on personnel's hand weaving topic. Always have following main some drawbacks:
Although 1, many platforms and application all have the random function of generating topic, it is substantially from existing problem data It is randomly choosed in library, and existing problem data library is mainly or by personnel's hand weaving;
2, the quality in problem data library and authorized personnel's professional standards are closely bound up, cease with the material that authorized personnel obtains Correlation, the science and objectivity of topic are unable to get guarantee in problem data library;
3, hand weaving is influenced by subjective factor, and errors and omissions probability is high, standard disunity, is the audit band in later period Carry out heavy burden, consumes more human and material resources and financial resources;
4, the foundation in problem data library will be undergone from data collection, multiple rings such as arrangement, establishment, classification, audit, storage Section, the circulation layer by layer of data all may cause Missing data, data contamination and artificial the problems such as distorting;
5, it is low to customize degree, for new demand, related personnel needs the process of study, and is directed to new demand weight The multiple process that storage is repeatedly collected from data, it is inefficient;
6, the process that problem data library generates can not continue to carry out, to limit the expansion and matter of problem data storage capacity The raising of amount.
Based on this, in various embodiments of the present invention, topic is automatically generated using the data of magnanimity and generates topic number According to library.
Data processing method provided in an embodiment of the present invention, which can be realized, automatically generates topic, and then automatically generates topic number According to library.
Topic mentioned herein is the topic type of objectivity, may include multiple-choice question, True-False, gap-filling questions etc., wherein Multiple-choice question supports multiple choice.
The embodiment of the present invention provides a kind of data processing method.Fig. 1 is the realization of data processing method of the embodiment of the present invention Flow diagram, as shown in Figure 1, the described method comprises the following steps:
Step S101 extracts corpus text from corpus.
Here, when practical application, the data stored in the corpus are that really occurred in the actual use of language Linguistic data;Corpus is the basic resource that linguistry is carried using electronic computer as carrier;Corpus text in corpus This needs by analysis and processing, useful resource could be become.
Here, the corpus text of extraction can be the corpus text extracted at random from corpus.
Step S102 carries out word segmentation processing to the corpus text according to first participle dictionary, the corpus that obtains that treated Text.
Here, the first participle dictionary is the dictionary for word segmentation different from default dictionary for word segmentation.
First participle dictionary and default dictionary for word segmentation are described in detail below with reference to actual example.
Assuming that seizing from certain official website to one section of corpus text " imitating LI Xiaopeng to jump ".The corpus text in default dictionary for word segmentation This is often segmented are as follows: imitation/LI Xiaopeng/jump;And in embodiments of the present invention, by a variety of segmentation methods, to the corpus In corpus text segmented, obtain word segmentation result;Using obtained word segmentation result, and preset dictionary for word segmentation is combined, built Found the first participle dictionary.Therefore, in embodiments of the present invention, the corpus text is divided according to first participle dictionary After word processing, " imitating LI Xiaopeng to jump " more maximum probability is broken down into: " imitation/LI Xiaopeng jumps ".
Here the word segmentation processing to the corpus text is completed, the corpus text for having already passed through participle has been obtained.
Step S103 filters out keyword based on treated corpus text;Wherein, the keyword is corpus text The middle participle that mapping relations are established as correct option and stem.
Here, the purpose of this step is the participle filtered out from the corpus text as correct option part.
, can be according to rule predetermined of setting a question during realization, treated that corpus text is screened to described, Filter out keyword;The rule predetermined of setting a question can be the fixed collocation mode of definition, such as " using ... as " solid The participle that ellipsis in fixed collocation represents filters out the specific corpus text that need to include as keyword, definition, such as by language The participle containing number is filtered out as keyword in material text, and defining preset rule of setting a question here can be according to practical application feelings Condition flexible choice, is not especially limited herein.
Step S104 is based on the keyword, utilizes treated corpus text generation topic.
Here, it since the structure type of different topic types is different, needs to seize rule according to different come realize will be after processing Corpus text generation topic.
For example it for, if the topic type of the topic generated is gap-filling questions, is replaced with designated symbols (such as bracket) Keyword generates stem, and establishes mapping relations using keyword as answer and stem, the stem that generates here, answer and Stem and the mapping relations of answer constitute topic.
If the topic type of the topic generated is multiple-choice question, other than replacing keyword with designated symbols (such as bracket), also It needs additionally to generate alternative wrong option, wherein the generation method of wrong option may include: at correct option (keyword) On the basis of carry out a certain amount of offset or similar word is replaced, be not especially limited here.Here the stem that generates multinomial is answered Case and stem and the mapping relations of multinomial answer constitute topic.
It is correct topic for judging result if the topic type of the topic generated is True-False, does not need to carry out crucial Word replacement processing, directly by treated, corpus text obtains stem;It is the topic of mistake for judging result, needs additional life Keyword is replaced at wrong answer and obtains stem, and wherein the generation method of wrong answer may include (crucial in: correct option Word) on the basis of carry out a certain amount of offset or similar word is replaced, be not especially limited here.Record keyword is replaced Number, and number is replaced according to keyword and establishes the answer of True-False, i.e., when replacement number is 0, True-False answer is correct; When replacement number is not 0, True-False answer is mistake.Here the mapping relations of the stem, answer and the stem that generate and answer Constitute topic.
The topic of different topic types is realized by different rules of seizing by the process of corpus text generation topic.
Data processing method provided in an embodiment of the present invention, by extracting corpus text from corpus;According to first point Word dictionary carries out word segmentation processing to the corpus text, the corpus text that obtains that treated;Based on treated corpus text, sieve Select keyword;Wherein, the keyword is the participle for establishing mapping relations in corpus text as correct option and stem;Base In the keyword, treated corpus text generation topic is utilized.In the embodiment of the present invention, by the keyword of screening, certainly The dynamic mapping relations for establishing stem and answer, and the mapping relations are accurate and reliable, such stem and answer matches malfunction and inscribe The probability that mesh answer is omitted can be effectively reduced;Simultaneously as this method, which can be realized effectively, automatically generates topic, in unit In time, the topic relative to hand weaving has obviously odds for effectiveness, thus saved time cost and manpower at This.
The embodiment of the present invention provides another data processing method.Fig. 2 is another kind of embodiment of the present invention data processing side The implementation process schematic diagram of method, as shown in Fig. 2, the described method comprises the following steps:
Step S201, establishes corpus.
Here, corpus is the basis established first participle dictionary and generate topic.The embodiment of the present invention can by with Lower step establishes corpus:
Step a obtains original language material text from specific website.
Here, due to the embodiment of the present invention towards be mass data, in order to guarantee the reliability and normalization of data, this It must be specific medium that inventive embodiments, which limit its data-interface source, and specific medium here specifically includes that every profession and trade on line News, blog and paper, report, study course and the literary works etc. issued with professional domain official and authoritative website platform, country The data such as decree, regulations and case of official and the publication of authoritative institution's line upper mounting plate, the electronic document of multiple format under line.
Here original language material text can be obtained from specific website by crawlers.
Step b is based on preset rules, is filtered to the original language material text, obtains effective corpus text.
Here, default rule includes preset corpus integrity rule and character recognition processing.First according to preset Corpus integrity rule screens the original language material text, then carries out at character recognition to the corpus text after screening It manages, the corpus text comprising that can not identify character after filtering screening obtains effective corpus text.
Here, the preset corpus integrity rule, which refers to, is arranged some integralities that must satisfy about to corpus in advance All primary attributes cannot take null value etc. in beam, such as relationship, can be adjusted here in conjunction with actual needs.
Step c establishes corpus using obtained effective corpus text.
Obtained effective corpus text is saved to corpus.
Here, the foundation of corpus is completed.
It when practical application, has new effective corpus text and constantly saves into corpus, held so corpus is in The continuous state for updating, enriching constantly.
Step S202 establishes first participle dictionary.
Here, first participle dictionary is the basis that inventive embodiments realize participle, and the abundant degree of first participle dictionary is straight Connect the order of accuarcy for influencing participle.The embodiment of the present invention can establish first participle dictionary by following steps:
Step a segments the corpus text in the corpus, obtains word segmentation result by segmentation methods.
Here segmentation methods include: the segmentation methods based on string matching, the segmentation methods based on understanding and are based on The segmentation methods etc. of statistics, various algorithms, which can be used alone, to be applied in combination.
It for example, can be first using reverse maximum matching method (RMM) and maximum matching method (MM) (RMM and MM tool for Body belongs to participle side's algorithm based on string matching) the corpus text in the corpus is tentatively segmented, then can (statistics side here is counted with the probability and collocation probability that occur in corpus text according to word, collocations mode Method particularly belongs to the segmentation methods based on statistics), word segmentation result is obtained according to statistical result.
Step b using obtained word segmentation result, and combines preset dictionary for word segmentation, establishes the first participle dictionary.
Here preset dictionary for word segmentation refers to have been obtained according to existing segmentation methods, can be called at any time for user Dictionary for word segmentation, the preset dictionary for word segmentation of such as present some mainstreams includes: jieba (comprising 16.6 words very much), IK (packet Containing 27.5 words very much), mmseg (including 15 words very much), word (including 27.5 words very much), these preset dictionary for word segmentation are general In the different programming language environment being integrated into the form of functional unit.Word can be both enriched in conjunction with preset dictionary for word segmentation Allusion quotation can also reduce calculated load amount when creation first participle dictionary.
Here the foundation of first participle dictionary is completed.
When practical application, after first participle dictionary is established, the new participle that can be obtained, new participle can be added to first point In word dictionary, so the state that first participle dictionary is in continuous updating, enriches constantly.
Wherein, since new participle continuously emerges, and frequency of use is higher within a period of time of appearance, it is possible thereby to Special marking processing is carried out to new participle, to increase the influence degree of new participle, and then is improved subsequent according to the first participle Dictionary carries out the efficiency of word segmentation processing to the corpus text.
Step S203 extracts corpus text from the corpus.
Here, when practical application, what is stored in the corpus is the language really occurred in the actual use of language Say material;Corpus is the basic resource that linguistry is carried using electronic computer as carrier;Corpus text in corpus needs Will by analysis and processing, useful resource could be become.
Here, corpus text is extracted from well-established corpus, the quality of corpus text can be improved, in order to rear The generation of continuous topic and the foundation in problem data library.
For example for, a corpus text now being seized from corpus: " Dragon Boat Festival and the Spring Festival, the Ching Ming Festival, mid-autumn Section and referred to as four great tradition red-letter day of folks of china.Dragon Boat Festival culture influences extensively in the world, in the world some countries and regions There is the custom for congratulating the Dragon Boat Festival.2006, State Council will be included in first batch of List of National Intangible Cultural Heritage the Dragon Boat Festival.The Dragon Boat Festival Origin covers ancient astrology culture, humane philosophy etc. content, contains deep abundant cultural connotation.In folk culture Field, the Chinese common people are food Zongzi, two great tradition etiquette and custom themes of the dragon-boat racing as the Dragon Boat Festival."
Step S204 carries out word segmentation processing to the corpus text according to the first participle dictionary, obtains that treated Corpus text.
Here, the first participle dictionary is the dictionary for word segmentation different from default dictionary for word segmentation.
First participle dictionary and default dictionary for word segmentation are described in detail below with reference to actual example.
Assuming that seizing from certain official website to one section of corpus text " imitating LI Xiaopeng to jump ".The corpus text in default dictionary for word segmentation This is often segmented are as follows: imitation/LI Xiaopeng/jump;And in embodiments of the present invention, by a variety of segmentation methods, to the corpus In corpus text segmented, obtain word segmentation result;Using obtained word segmentation result, and preset dictionary for word segmentation is combined, built Found the first participle dictionary.Therefore, in embodiments of the present invention, the corpus text is divided according to first participle dictionary After word processing, " imitating LI Xiaopeng to jump " more maximum probability is broken down into: " imitation/LI Xiaopeng jumps ".
During specific implementation, it can be segmented by the 2-gram of Python (computer programming language) a kind of Text is resolved into the participle of several text segments by instruction.
Here the word segmentation processing to the corpus text is completed, the corpus text for having already passed through participle has been obtained.
Step S205 filters out keyword based on treated corpus text;Wherein, the keyword is corpus text The middle participle that mapping relations are established as correct option and stem.
Here, the purpose of this step is the participle filtered out from the corpus text as correct option part.
It, can be according to pre-defining preset rule of setting a question during realization, treated that corpus text is carried out to described Screening, filters out keyword;The rule predetermined of setting a question can be the fixed collocation mode of definition, such as " ... make For " participle that represents of the ellipsis in fixed collocation filters out as keyword, defines the specific corpus text that need to include, such as Participle containing number in corpus text is filtered out as keyword, defining preset rule of setting a question here can be according to actually answering With situation flexible choice, it is not especially limited herein.
Here further detailed description is carried out to the process of screening keyword by way of example.
For above-mentioned example, that is, the corpus text that extracts be " Dragon Boat Festival and the Spring Festival, the Ching Ming Festival, the Mid-autumn Festival and be known as China Civil four great traditions red-letter day.Dragon Boat Festival culture influences extensively in the world, and some countries and regions, which also have, in the world congratulates the Dragon Boat Festival Custom.2006, State Council will be included in first batch of List of National Intangible Cultural Heritage the Dragon Boat Festival.Dragon Boat Festival origin covers Gu Old star contains deep abundant cultural connotation as culture, humane philosophy etc. content.In folk culture field, middle its people Crowd is using food Zongzi, dragon-boat racing as the two great tradition etiquette and custom themes of the Dragon Boat Festival ", which is proceeded as follows:
1, definition need to can then be sieved comprising specific corpus text (one of number, Time of Day, place etc.) as rule of setting a question Select keyword: " 2006 ".
2, the fixed collocation mode of definition (" using ... as " collocation mode, " " ... " " arrange in pairs or groups mould Formula ...;……;... one of collocation mode etc.) as rule of setting a question, then can filter out keyword: " food Zongzi, match dragon Boat ".
Step S206 is based on the keyword, utilizes treated corpus text generation topic.
Here it since the structure type of different topic types is different, needs to seize rule according to different and realize treated Corpus text generation topic.
For example it for, if the topic type of the topic generated is gap-filling questions, is replaced with designated symbols (such as bracket) Keyword generates stem, and establishes mapping relations using keyword as answer and stem, the stem that generates here, answer and Stem and the mapping relations of answer constitute topic.
If the topic type of the topic generated is multiple-choice question, other than replacing keyword with designated symbols (such as bracket), also It needs additionally to generate alternative wrong option, wherein the generation method of wrong option may include: at correct option (keyword) On the basis of carry out a certain amount of offset or similar word is replaced, be not especially limited here.Here the stem that generates multinomial is answered Case and stem and the mapping relations of multinomial answer constitute topic.
Here further detailed description is carried out to the process for generating alternative wrong option by way of example.For above-mentioned example Son proceeds as follows:
If 1, keyword is numeric type, according to the figure pattern of correct option, a certain amount of offset is done, according to setting Quantity generates other alternate items, as " 2006 " as answer and stem to establish mapping outer in example, and is labeled as correct option, together Shi Shengcheng " 2005 ", " 2007 " and " 2016 " alternately wrong option, and correct option and alternative wrong option with topic It is dry to establish mapping relations.
If 2, keyword be such as " food Zongzi, dragon-boat racing " a kind of keyword without obvious mode, can be by corpus The method of search key match pattern generates other alternative options in library, such as " food Zongzi, dragon-boat racing " can corpus its The similar words such as " food the rice dumpling, dragon-boat racing ", " food Zongzi, admire the full moon bright " are retrieved in his corpus text, using conduct after verification Alternative wrong option and correct option establish mapping relations with stem together.
It is correct topic for judging result if the topic type of the topic generated is True-False, does not need to carry out crucial Word replacement processing, directly by treated, corpus text obtains stem;It is the topic of mistake for judging result, needs additional life Keyword is replaced at wrong answer and obtains stem, and wherein the generation method of wrong answer may include: (crucial in correct option Word) on the basis of carry out a certain amount of offset or similar word is replaced, be not especially limited here.Record keyword is replaced Number, and number is replaced according to keyword and establishes the answer of True-False, i.e., when replacement number is 0, True-False answer is correct; When replacement number is not 0, True-False answer is mistake.Here the mapping of stem, answer and the stem and multinomial answer that generate Relationship constitutes topic.
The topic of different topic types is realized by different rules of seizing by the process of corpus text generation topic.
Step S207, the topic based on generation generate problem data library.
When determining the not stored topic in problem data library, the topic is saved to the problem data library.
In one embodiment, the topic of generation will also be by the processing such as data cleansing, and carrying out that treated, topic could be protected It deposits to problem data library, and can also be modified and improve problem data library according to the feedback of third-party application and platform.
Here, when practical application, the data cleansing may include:
Semantic audit is carried out to the topic of the generation, to the topic that semantic audit passes through, carries out integrality and correctness Verification.
Data processing method provided by the embodiment of the present invention initially sets up corpus and first participle dictionary, then from Corpus text is extracted in corpus;Word segmentation processing is carried out to the corpus text according to first participle dictionary, obtains that treated Corpus text;Based on treated corpus text, keyword is filtered out;Wherein, the keyword is in corpus text as just True answer and stem establish the participle of mapping relations;Based on the keyword, treated corpus text generation topic is utilized.Most Afterwards based on the topic of generation, problem data library is generated.In the embodiment of the present invention, by the keyword of screening, topic is established automatically The dry mapping relations with answer, and the mapping relations are accurate and reliable, such stem and answer matches error and topic answer are omitted Probability can be effectively reduced;Meanwhile this method can be realized effectively and automatically generate topic, within the unit time, relative to The topic of hand weaving has obviously odds for effectiveness, to save time cost and human cost.
In addition, the data source of the corpus of foundation is in specific website, such as professional domain official in the embodiment of the present invention Platform publication, to ensure that the reliability and normalization of data source, the science of the topic thus generated and problem data library Property and it is authoritative higher;The first participle dictionary of foundation than the participle of preset dictionary for word segmentation the property of can refer to more preferably, to make The accuracy of word segmentation processing is higher, and the accuracy of the topic and problem data library that thereby ensure that generation is also higher.
In addition to this, problem data library can be modified according to the feedback of third-party application and platform, i.e. problem data Library is continuously running duty, and thus problem data library iteration can update in time, and timeliness is stronger.
The present invention is described in further detail again below with reference to Application Example.
Application Example of the present invention provides a kind of topic generation method.Fig. 3 is that Application Example of the present invention generates topic side The implementation process schematic diagram of method, as shown in figure 3, the described method comprises the following steps:
Step S301 acquires data according to default rule.
Here, the data source capability official platform of acquisition and authoritative institution's publication, to ensure the reliability and specification of data Property, and crawler and data acquisition are carried out to target data source by crawlers.
Here, default rule refers to the rule with unified standard format prepared in advance, the rule supports to increase, The basic operations such as deletion, modification, inquiry, and the rule can carry out permutation and combination according to demand.It, can be with by setting rule Collect targeted data.
Step S302 is filtered according to data of the default rule to acquisition, obtains filtered data.
Here, primary filtration, mistake are carried out to the data of acquisition according to by the related decree regulation of country and integrity rule The corpus text for filtering the sensitive corpus text for being not suitable for formally issuing and lacking integrality further can also be to filtering Corpus text afterwards carries out character recognition processing, the corpus text comprising that can not identify character after filtering screening, to obtain Legal, available data.
Here it is possible to form corpus using filtered data.
Step S303 carries out word segmentation processing to filtered data, and verifies to word segmentation result, obtains at participle Data after reason.
Here, it during specific implementation, can be realized by following steps:
Step a carries out word segmentation processing to filtered data by the 2-gram instruction of Python;Obtain word segmentation result;
Step b replaces point for meeting official or authoritative institution's publication using regular expression to obtained word segmentation result Word, to guarantee that participle used in data is consistent with official or authoritative institution's publication.
Here, the data of word segmentation processing have been obtained having carried out.
Step S304 filters out keyword based on the data after obtained word segmentation processing;Wherein, the keyword is language The participle of mapping relations is established in material text as correct option and stem.
Step S305 is based on the keyword, using treated corpus text, generates topic in conjunction with morphological analysis.
Here, the topic type of generation can be multiple-choice question, True-False or gap-filling questions etc..
Specifically, if the topic type of the topic generated is gap-filling questions, keyword is replaced with designated symbols (such as bracket) Generate stem, and establish mapping relations using keyword as answer and stem, the stem, answer and the stem that generate here and The mapping relations of answer constitute topic.
If the topic type of the topic generated is multiple-choice question or True-False, keyword is replaced with designated symbols (such as bracket) Generate stem, it is also necessary to spare wrong answer is additionally generated, the generation method of the spare wrong answer is morphological analysis, It may include: to carry out a certain amount of offset or similar word replacement on the basis of correct option (keyword), do not make to have here Body limits.Here the stem, answer and the stem that generate and the mapping relations of answer constitute topic.
Step S306 is filtered according to topic of the default rule to generation, obtains filtered topic.
Here, default rule include semantic audit and integrality, correctness verification and whether the detection of multiplicity.This In semantic audit is carried out to the topic of generation first;Then the topic passed through to semantic audit carries out integrality and correctness school It tests, then multiplicity confirmation is carried out with generated topic to the topic that verification passes through, duplicate topic is filtered out, after obtaining filtering Topic.
Step S307 classifies to obtained filtered topic, and is saved the topic according to the result of classification To problem data library.
Here, classify first to obtained filtered topic, then establish classified index, and root for the topic The topic is saved into problem data library according to the result of classification.After being easy for for the purpose that the topic establishes classified index The maintenance and upgrade in continuous problem data library.
Step S308 establishes unified external interface to the problem data library.
Here, unified external interface is established to the problem data library, and the external interface is backward compatible.When Externally publication problem data library when, be all made of the unified external interface, in order to third-party application and platform access and It calls.And the downward compatibility of external interface ensure that, the problem data library of upgrade version remains to access before supporting upgrading The normal use of third-party application and platform.
Step S309 is modified the problem data library and perfect according to the feedback of third-party application and platform.
According to the feedback of third-party application and platform, under the premise of validity is fed back in verification, to the problem data library It is modified and perfect.
Here feedback can be applied to step S306, after being modified to the topic of feedback, continue subsequent step.
Data processing method provided in an embodiment of the present invention is generated to the stream that classification storage saves from data acquisition, topic Journey has been all made of automatic flow, to effectively prevent data loss of data as caused by artificial, pollution during circulation The problems such as with distorting.In addition, in the embodiment of the present invention, by the i.e. customizable topic of setting extracting rule, to obtain topic number According to library, and complicated data can satisfy by the combination of rule and extract scene.Problem data library is with unified to external Mouthful, to make problem data library convenient for the access and calling of third-party application and platform.
Method in order to realize the embodiment of the present invention, the embodiment of the present invention also provide a kind of data processing equipment, and Fig. 4 is this The composed structure schematic diagram of inventive embodiments device, as shown in figure 4, described device 40 includes: extraction unit 41, word segmentation processing list Member 42, screening unit 43 and the first generation unit 44, in which:
The extraction unit 41 is configured to extract corpus text from corpus;
The word segmentation processing unit 42 is configured to carry out word segmentation processing to the corpus text according to first participle dictionary, The corpus text that obtains that treated;
The screening unit 43, the corpus text that is configured to that treated filter out keyword;Wherein, the key Word is the participle for establishing mapping relations in corpus text as correct option and stem;
First generation unit 44, is configured to the keyword, utilizes treated corpus text generation topic.
In one embodiment, described device 40 further includes the first creating unit, is used for:
Original language material text is obtained from specific website;
Based on preset rules, the original language material text is filtered, obtains effective corpus text;
Using obtained effective corpus text, corpus is established.
In one embodiment, first creating unit, is specifically used for:
The original language material text is screened according to preset corpus integrity rule, the corpus text after being screened This;
Character recognition processing is carried out to the corpus text after screening, the corpus text after filtering screening obtains effective corpus Text.
In one embodiment, described device 40 further includes the second creating unit, is used for:
By segmentation methods, the corpus text in the corpus is segmented, word segmentation result is obtained;
Using obtained word segmentation result, and preset dictionary for word segmentation is combined, establishes the first participle dictionary.
In one embodiment, described device 40 further includes the second generation unit, is used for,
When determining the not stored topic in problem data library, the topic is saved to the problem data library.
In one embodiment, second generation unit, is specifically used for:
Data cleansing processing is carried out to the topic of generation, cleaned topic is saved to problem data library.
When practical application, the extraction unit 41, word segmentation processing unit 42, screening unit 43, the first generation unit 44, One creating unit, the second creating unit and the second generation unit can be realized by the processor in data processing equipment.
It should be understood that data processing equipment provided by the above embodiment is when carrying out data processing, only with above-mentioned each The division progress of program module can according to need for example, in practical application and distribute above-mentioned processing by different journeys Sequence module is completed, i.e., the internal structure of device is divided into different program modules, to complete whole described above or portion Divide processing.In addition, data processing equipment provided by the above embodiment and data processing method embodiment belong to same design, have Body realizes that process is detailed in embodiment of the method, and which is not described herein again.
Based on the hardware realization of above procedure module, and the method in order to realize the embodiment of the present invention, the embodiment of the present invention A kind of data processing equipment is provided, as shown in figure 5, described device 50 includes: processor 51 and is configured to store and can handle The memory 52 of the computer program run on device, in which:
The processor 51 executes said one when being configured to run the computer program or multiple technical solutions provide Method.
When practical application, as shown in figure 5, the various components in described device 50 are coupled by bus system 53. It is understood that bus system 53 is for realizing the connection communication between these components.Bus system 53 remove include data/address bus it It outside, further include power bus, control bus and status signal bus in addition.It, will be various total in Fig. 5 but for the sake of clear explanation Line is all designated as bus system 53.
In the exemplary embodiment, the embodiment of the invention also provides a kind of storage mediums, are computer-readable storage mediums Matter, the memory 52 for example including computer program, above-mentioned computer program can be held by the processor 51 of data processing equipment 50 Row, to complete step described in preceding method.Computer readable storage medium can be magnetic RAM (FRAM, Ferromagnetic random access memory), read-only memory (ROM, Read Only Memory), it is programmable only Read memory (PROM, Programmable Read-Only Memory), Erasable Programmable Read Only Memory EPROM (EPROM, Erasable Programmable Read-Only Memory), electrically erasable programmable read-only memory (EEPROM, Electrically Erasable Programmable Read-Only Memory), flash memory (Flash Memory), magnetic surface storage, CD or CD-ROM (CD-ROM, Compact Disc Read-Only Memory) etc. Memory.
It should be understood that it should be understood that " first ", " second " etc. are to be used to distinguish similar objects, without It is used to describe a particular order or precedence order.
In addition, between technical solution documented by the embodiment of the present invention, it in the absence of conflict, can be in any combination.
The foregoing is only a preferred embodiment of the present invention, is not intended to limit the scope of the present invention, it is all Made any modifications, equivalent replacements, and improvements etc. within the spirit and principles in the present invention, should be included in protection of the invention Within the scope of.

Claims (10)

1. a kind of data processing method, which is characterized in that the described method includes:
Corpus text is extracted from corpus;
Word segmentation processing is carried out to the corpus text according to first participle dictionary, the corpus text that obtains that treated;
Based on treated corpus text, keyword is filtered out;Wherein, the keyword is that correct option is used as in corpus text The participle of mapping relations is established with stem;
Based on the keyword, treated corpus text generation topic is utilized.
2. the method according to claim 1, wherein it is described from corpus extract corpus text before, institute State method further include:
Original language material text is obtained from specific website;
Based on preset rules, the original language material text is filtered, obtains effective corpus text;
Using obtained effective corpus text, corpus is established.
3. according to the method described in claim 2, it is characterized in that, described be based on preset rules, to the original language material text It is filtered, obtains effective corpus text, comprising:
The original language material text is screened according to preset corpus integrity rule, the corpus text after being screened;
Character recognition processing is carried out to the corpus text after screening, obtains effective corpus text.
4. the method according to claim 1, wherein the method also includes:
By segmentation methods, the corpus text in the corpus is segmented, word segmentation result is obtained;
Using obtained word segmentation result, and preset dictionary for word segmentation is combined, establishes the first participle dictionary.
5. the method according to claim 1, wherein the method also includes:
When determining the not stored topic in problem data library, the topic is saved to the problem data library.
6. a kind of data processing equipment characterized by comprising
Extraction unit, for extracting corpus text from corpus;
Word segmentation processing unit, for, to corpus text progress word segmentation processing, being obtained according to first participle dictionary, treated Corpus text;
Screening unit, for filtering out keyword based on treated corpus text;Wherein, the keyword is corpus text The middle participle that mapping relations are established as correct option and stem;
First generation unit utilizes treated corpus text generation topic for being based on the keyword.
7. device according to claim 6, which is characterized in that described device further includes the first creating unit, is used for:
Original language material text is obtained from specific website;
To be filtered to the original language material text based on preset rules, effective corpus text is obtained;
Using obtained effective corpus text, corpus is established.
8. device according to claim 6, which is characterized in that described device further includes the second creating unit, is used for:
By segmentation methods, the corpus text in the corpus is segmented, word segmentation result is obtained;
Using obtained word segmentation result, and preset dictionary for word segmentation is combined, establishes the first participle dictionary.
9. a kind of data processing equipment characterized by comprising processor and be configured to storage and can run on a processor The memory of computer program;
Wherein, when the processor is configured to run the computer program, perform claim requires any one of 1 to 5 the method The step of.
10. a kind of storage medium, is stored thereon with computer program, which is characterized in that the computer program is held by processor The step of any one of claim 1 to 5 the method is realized when row.
CN201811634468.0A 2018-12-29 2018-12-29 Data processing method, device and storage medium Pending CN109753656A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811634468.0A CN109753656A (en) 2018-12-29 2018-12-29 Data processing method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811634468.0A CN109753656A (en) 2018-12-29 2018-12-29 Data processing method, device and storage medium

Publications (1)

Publication Number Publication Date
CN109753656A true CN109753656A (en) 2019-05-14

Family

ID=66404351

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811634468.0A Pending CN109753656A (en) 2018-12-29 2018-12-29 Data processing method, device and storage medium

Country Status (1)

Country Link
CN (1) CN109753656A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175332A (en) * 2019-06-03 2019-08-27 山东浪潮人工智能研究院有限公司 A kind of intelligence based on artificial neural network is set a question method and system
CN112800200A (en) * 2021-01-26 2021-05-14 广州欢网科技有限责任公司 Program title compiling method, device and equipment
CN113177117A (en) * 2021-03-18 2021-07-27 深圳市北科瑞讯信息技术有限公司 News material acquisition method and device, storage medium and electronic device
CN113221558A (en) * 2021-05-28 2021-08-06 中邮信息科技(北京)有限公司 Express delivery address error correction method and device, storage medium and electronic equipment
CN113505195A (en) * 2021-06-24 2021-10-15 作业帮教育科技(北京)有限公司 Knowledge base, construction method and retrieval method thereof, and question setting method and system based on knowledge base
CN113627137A (en) * 2021-10-11 2021-11-09 江西软云科技股份有限公司 Question generation method, question generation system, storage medium and equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136302A (en) * 2011-12-05 2013-06-05 北大方正集团有限公司 Method and device of test question repeat output
JP2015215681A (en) * 2014-05-08 2015-12-03 日本放送協会 Keyword extraction device and program
CN106409041A (en) * 2016-11-22 2017-02-15 深圳市鹰硕技术有限公司 Generation method and system for gap filling test question and grading method and system for gap filling test paper

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136302A (en) * 2011-12-05 2013-06-05 北大方正集团有限公司 Method and device of test question repeat output
JP2015215681A (en) * 2014-05-08 2015-12-03 日本放送協会 Keyword extraction device and program
CN106409041A (en) * 2016-11-22 2017-02-15 深圳市鹰硕技术有限公司 Generation method and system for gap filling test question and grading method and system for gap filling test paper

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175332A (en) * 2019-06-03 2019-08-27 山东浪潮人工智能研究院有限公司 A kind of intelligence based on artificial neural network is set a question method and system
CN112800200A (en) * 2021-01-26 2021-05-14 广州欢网科技有限责任公司 Program title compiling method, device and equipment
CN113177117A (en) * 2021-03-18 2021-07-27 深圳市北科瑞讯信息技术有限公司 News material acquisition method and device, storage medium and electronic device
CN113221558A (en) * 2021-05-28 2021-08-06 中邮信息科技(北京)有限公司 Express delivery address error correction method and device, storage medium and electronic equipment
CN113221558B (en) * 2021-05-28 2023-09-19 中邮信息科技(北京)有限公司 Express address error correction method and device, storage medium and electronic equipment
CN113505195A (en) * 2021-06-24 2021-10-15 作业帮教育科技(北京)有限公司 Knowledge base, construction method and retrieval method thereof, and question setting method and system based on knowledge base
CN113627137A (en) * 2021-10-11 2021-11-09 江西软云科技股份有限公司 Question generation method, question generation system, storage medium and equipment

Similar Documents

Publication Publication Date Title
CN109753656A (en) Data processing method, device and storage medium
CN106055541B (en) A kind of news content filtering sensitive words method and system
CN102591854B (en) For advertisement filtering system and the filter method thereof of text feature
CN104050224B (en) Combining different type coercion components for deferred type evaluation
CN109885824A (en) A kind of Chinese name entity recognition method, device and the readable storage medium storing program for executing of level
CN112749284B (en) Knowledge graph construction method, device, equipment and storage medium
CN104966031A (en) Method for identifying permission-irrelevant private data in Android application program
CN108228571B (en) Method and device for generating couplet, storage medium and terminal equipment
CN105740227A (en) Genetic simulated annealing method for solving new words in Chinese segmentation
CN109408811A (en) A kind of data processing method and server
CN105468744A (en) Big data platform for realizing tax public opinion analysis and full text retrieval
CN113434685B (en) Information classification processing method and system
CN110189751A (en) Method of speech processing and equipment
CN106843941A (en) Information processing method, device and computer equipment
CN105095436A (en) Automatic modeling method for data of data sources
CN107679075A (en) Method for monitoring network and equipment
CN112948664A (en) Method and system for automatically processing sensitive words
Sun et al. Design and Application of an AI‐Based Text Content Moderation System
CN117520503A (en) Financial customer service dialogue generation method, device, equipment and medium based on LLM model
JP2010277409A (en) Representative sentence extracting device and program
CN109816038A (en) A kind of Internet of Things firmware program classification method and its device
CN109672586A (en) A kind of DPI service traffics recognition methods, device and computer readable storage medium
CN111736804B (en) Method and device for identifying App key function based on user comment
CN106708922A (en) Character relation atlas analysis method based on mass data
CN111008285B (en) Author disambiguation method based on thesis key attribute network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination