CN110298033A - Keyword corpus labeling trains extracting tool - Google Patents

Keyword corpus labeling trains extracting tool Download PDF

Info

Publication number
CN110298033A
CN110298033A CN201910455064.3A CN201910455064A CN110298033A CN 110298033 A CN110298033 A CN 110298033A CN 201910455064 A CN201910455064 A CN 201910455064A CN 110298033 A CN110298033 A CN 110298033A
Authority
CN
China
Prior art keywords
keyword
corpus
model
algorithm
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910455064.3A
Other languages
Chinese (zh)
Other versions
CN110298033B (en
Inventor
崔莹
代翔
黄细凤
王侃
杨拓
余博
朱宇涛
李超
李源源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Electronic Technology Institute No 10 Institute of Cetc
Original Assignee
Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Electronic Technology Institute No 10 Institute of Cetc filed Critical Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority to CN201910455064.3A priority Critical patent/CN110298033B/en
Publication of CN110298033A publication Critical patent/CN110298033A/en
Application granted granted Critical
Publication of CN110298033B publication Critical patent/CN110298033B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of keyword corpus labeling training extracting tool disclosed by the invention, it is desirable to provide one kind can reduce the complicated degree of artificial annotation process, and can improve the mark training tool of magnanimity keyword corpus annotating efficiency and accuracy rate.The technical scheme is that: keyword corpus labeling preparation module distinguishes the magnanimity corpus data of separate sources, semi-automatic corpus keyword labeling module creation keyword marks task, autonomous selection adaptation algorithm simultaneously carries out the automatic marking based on algorithm model, by integrating CHI, LDA, TEXTRANK, at least one of TFIDF keyword abstraction algorithm, pre- mark processing is carried out to corpus of text data to be marked, and many algorithms annotation results are merged, after the completion of mark task, reaction type keyword marking model learning training module is trained keyword dimensioning algorithm model;Keyword marking model recruitment evaluation module assesses model quantification of targets mark effect automatically.

Description

Keyword corpus labeling trains extracting tool
Technical field
The present invention relates to Text Mining Technology fields more particularly to keyword corpus semi-automation mark training to extract work Tool.
Background technique
In natural language processing field, the text file of magnanimity is handled it is crucial that the most concerned problem of user is mentioned It takes out.Regardless of being that can often spy upon the theme of entire text by several keywords for long text or short text Thought.At the same time, searched for whether based on the recommendation of text or text based, for text key word dependence also very Greatly, the order of accuarcy of keyword extraction is directly related to the final effect of recommender system or search system.Therefore, keyword mentions Taking in text mining field is a critically important part.The rapid development of network provides easy acquisition of information way for people The quantity of diameter, the electronic documents such as webpage, mail, e-book is more and more.But the information resources of this explosive growth lack The structuring of content, makes people while obtaining bulk information, also has to take much time beam reading and arranges this A little information greatly reduce the recall precision of people.Therefore, how numerous and jumbled unordered resource to be organized, improves and utilizes letter The efficiency of breath makes the key message that people are easy, fast and accurately obtain these texts just become abnormal important.Keyword is certainly Dynamic extraction is widely used in all fields.Especially in knowledge excavation, information retrieval, text cluster, text classification etc. field, Keyword automatic indexing is even more basis and core technology.And in fields such as relevant feedback, automatic fitration, event detection and tracking, Keyword automatic indexing technology is also to play the crucial effect of comparison.It can be said that keyword automatic indexing technology is to carry out institute There is the element task of text Automatic analysis, is all essential in the work of many text analyzings.Keyword abstraction Automatic abstract, information retrieval, text classification, in terms of have a very important role.How intelligent, quick, Effectively have become a problem in the urgent need to address in current computer field from acquisition information on internet.Keyword Extraction is to realize the important means quickly and precisely obtained to internet information.Keyword has some characteristics, keyword one As be all noun or noun phrase;, keyword generally will not start or be terminated with stop words;The length of keyword is generally not It can be too long.Keyword feature will consider the problems of in the work of keyword abstraction that the exploitation and selection of feature are one when selecting A emphasis is also a difficult point, and feature is selected well and is badly directly related to the judgement of keyword.
In recent years, the high speed development of obtaining means is acquired with big data, the explosive growth of text information, makes on network The difficulty that the information needed must be obtained increasingly increases.It excavates to maximize to be worth from data and becomes especially urgent, this is to big data Intelligent analysis propose completely new demand.In order to handle the information resources of high speed expansion, manual processing method becomes not suit It is practical, it is therefore desirable to help people effectively to manage using automatic processing method, organizational information, to solve abundant information knowledge Poor problem.In this context, the technologies such as machine learning, deep learning using upper fast development and obtain huge in big data Success, the model algorithm that technology bottom uses more need to rely on based on a large amount of data mark corpus and train branch Support.Majority keyword abstraction algorithm is the importance for utilizing the statistical information grammatical term for the character of word at present, and choosing is more than certain threshold value Keyword of the word as article.But statistical method calculation amount is excessive, and needs a large amount of statistics corpus.Based on this side Method proposes multiple keyword measure functions, including TF/IDF, entropy function, breadth coefficient etc..Many machine learning algorithms are also answered For in keyword abstraction, such as CHI, TFIDF both can be used as the method that Feature Selection can also be used as weight calculation, different Place is that TFIDF can be used for any text collection, and the label that CHI then needs text to have tag along sort could calculate. TextRank is put forward as keyword abstraction method, is also had tried to later as weighing computation method, still The computation complexity of TextRank is very high.
The work of mass data corpus labeling has an important influence the training of algorithm model, while as big data analysis mistake Basic work in journey, the links such as the main support daily research and development of big data, algorithm tuning, demonstration and verification are that big data is dug Dig the key foundation of analysis.Keyword is to statement text the main contents include the vocabulary of essential meaning, is to meet text mark Draw or retrieve work and select from autograph, abstract and text the word or phrase come.Keyword extraction is by core The statistics and semantic analysis of word, selection is suitable from single text or a text set, can express in theme completely The process of the feature item collection of appearance.Since keyword is the most basic unit for indicating text subject meaning, so in autoabstract, letter It ceases the natural language processings such as retrieval, text cluster, automatic question answering, Topic Tracking and Chinese information processing field usually will be advanced Row keyword extraction, and keyword extraction also has important clue value for information monitoring and tracking.Part of speech is to pass through The result obtained after participle, syntactic analysis.In existing keyword, most keywords are noun or gerund.Generally In the case of, there is very big value in the position that word occurs for word.For example, title, abstract inherently author summarize Article central idea, therefore appear in these place words have certain representativeness, be more likely to become keyword.But It is that, because the habit of each author is different, writing mode is different, and the position of crucial sentence would also vary from, so this is also A method of very wide in range obtains keyword, will not be used alone under normal circumstances.Judge a word is in an article No important, a measurement index being readily apparent that is exactly word frequency, and important word often repeatedly occurs in article.But another party Face, it is certain important without being the word more than frequency of occurrence, because some words all frequently occur in various articles, its importance Certainly it is strong to be not so good as the word importance that those are only frequently occurred in certain article.Word frequency indicates the frequency that a word occurs in the text Rate.As soon as it is general it is believed that if word occur in the text frequently, then this word is more possible as article Core word.The number that word frequency simple geo-statistic word occurs in the text, still, relying only on the obtained keyword of word frequency has Very big uncertainty, the text long for length, this method have very big noise.Key based on statistical nature The calculating for focusing on characteristic quantification index of word, the result that different quantizating index obtains also are not quite similar.Meanwhile it is different Quantizating index also has its respective advantage and disadvantage, in practical applications, usually in such a way that different quantizating index combines Topk word is obtained as keyword.Keyword extraction has very wide application, existing method in text mining field It has certain problems.Method of the prior art based on statistics and machine learning, the method effect of machine learning more rely on In manually mark corpus, i.e., model parameter is trained according to the data (corpus marked) observed, in the participle stage The probability that various participles occur is calculated by model again, using the big word segmentation result of probability as termination fruit.Based on machine learning side The premise that method can be implemented is the sufficiently large knowledge base of data volume to be established or training library.The problem of due to current knowledge learning It not yet fundamentally solves, the update of knowledge base is very slow, does not catch up with current scientific development.Provided by the data set marked Limited information, and the manually mark of sample is time-consuming and laborious, and it is too big to carry out large-scale mark consumption.What is be easy to get does not mark sample Sample is more relative to having marked for this (webpage on such as internet) quantity, and also close to the data on entire sample space point Cloth.Mark samples more as far as possible is provided and needs arduous and slow hand labor, affects the building of whole system, this is just generated The problem of one mark bottleneck.The thought of keyword abstraction algorithm based on statistical nature is the statistics using word in document The keyword of information extraction document.Text is usually obtained into the set of candidate word by pretreatment, then uses characteristic value amount The mode of change obtains keyword from candidate collection.Common sequence labelling model has HMM and CRF.This kind of segmentation methods can be very Benefit manages ambiguity and unregistered word problem, and effect is better than preceding a kind of effect, but needs a large amount of artificial labeled data, and compared with Slow participle speed.The keyword abstraction algorithm of supervision is to regard keyword abstraction algorithm as two classification problems, judges text Word or phrase in shelves are or are not keywords.Since being classification problem, it is necessary to provide the training language marked Material carries out keyword pumping to the document for needing extracting keywords according to model using training corpus training keyword extraction model It takes.
Traditional keyword abstraction method is divided into two kinds, respectively unsupervised approaches and has measure of supervision.It is wherein unsupervised Method includes the methods of TF-IDF, Chi-squared, TextRank, LDA, and has measure of supervision to turn keyword abstraction problem Be changed to judge each word whether be keyword two classification problems, before once someone pass through NaiveBayes and decision tree C4.5 Etc. have measure of supervision carry out keyword abstraction.Unsupervised approaches and there is measure of supervision respectively and there are its Pros and Cons: unsupervised approaches It does not need manually to mark training set, therefore more quick, but is sorted since much information can not be comprehensively utilized to candidate word, institute With may not be if any measure of supervision in effect;And there is measure of supervision that can close by training learning regulation much information for judgement The influence degree of keyword, therefore effect is more preferable, but in data age now, mark training set and take time and effort very much.Have The text key word extraction algorithm of supervision, which is disadvantageous in that, needs high cost of labor.Third class is by allowing computer mould Understanding of the personification to sentence achievees the effect that identify word, due to the complexity of Chinese semantic meaning, it is difficult to by various language message tissues It is time-consuming and laborious using artificial method due to needing to mark a large amount of training corpus at the form that machine can identify, at present this Kind Words partition system is also in experimental stage.It is all using the word after pretreated as section in the building process of linguistic network figure Point, the relationship between word and word is as side.In linguistic network figure, while while between the general word of weight between the degree of association come It indicates.When obtaining keyword using linguistic network figure, need to assess the importance of each node, then according to importance Node is ranked up, chooses word representated by TopK node as keyword.The characteristics of due to Chinese language itself, do not have Explicit word boundary increases certain difficulty for crucial word string automatic indexing task again, and keyword abstraction model needs more Training corpus can be only achieved stable effect.In practical applications, due to the complexity of application environment, for different types of Text, such as long text and short text, the effect obtained with same text key word extracting method are simultaneously identical.In practical application It will be different for algorithm used by different conditions environmentals, have under all circumstances very well without certain one kind algorithm Effect.Meanwhile the accuracy segmented in engineering for the pretreatment of text and text also has very big dependence.For text Wrong word, the information such as alternative word need to be resolved in pretreatment stage, the selection of segmentation methods, unregistered word and discrimination The identification of adopted word is prominent for keyword to a certain extent to extract meeting and very big influence.Keyword extraction is one and seems letter It is single, but very intractable task in practical applications.Due to Chinese opinion type subjective texts tagged corpus contain participle, The mark of the bulk informations such as part of speech, dependence, semanteme, word concept, opinion, completion is usually relatively complex.In order to mitigate mark The burden of personnel improves the Efficiency and accuracy of mark, reduces the error rate of mark, it is necessary to develop a for key words The automatic marking tool of material assists the work of mark personnel.Keyword corpus is relatively deficient in field at present, and key words Material mark work mainly completed at present by manually marking, be widely present corpus labeling is of poor quality, annotation process is cumbersome, mark The problems such as low efficiency, high cost of human resources.Meanwhile that there are mask methods is single for existing keyword corpus annotation tool, is difficult Mask method model is carried out the drawback such as to automatically update, therefore, there is an urgent need to it is a set of be capable of indirect labor mark the half of corpus from Dynamic keyword mark solves problem above with training platform.
Summary of the invention
Goal of the invention of the invention is to be conceived to solve to deposit in above-mentioned keyword corpus labeling and training process using corpus The drawbacks of, the complicated degree of artificial annotation process can be reduced by providing one kind, mitigate manual work's cost, and can improve magnanimity keyword The semi-automatic keyword corpus labeling training tool of corpus labeling efficiency and accuracy rate.
Above-mentioned purpose of the invention can be achieved by following technical proposals: a kind of keyword corpus labeling is trained to be mentioned Take tool, comprising: keyword corpus labeling preparation module, semi-automatic corpus keyword labeling module, reaction type keyword mark Injection molding type learning training module and keyword marking model recruitment evaluation module, it is characterised in that: keyword corpus labeling prepares Module distinguishes the magnanimity corpus data of separate sources, for the keyword corpus of different purposes, comes to keyword corpus Source selection, is set to corpus to be marked for different purposes, i.e., raw corpus;Semi-automatic corpus keyword labeling module Creation keyword first marks task, autonomous to select adaptation algorithm and carry out for different labeled use demand and corpus feature Automatic marking based on algorithm model, by integrate CHI, LDA, based on figure sequence keyword extraction algorithm, TEXTRANK, At least one of TFIDF keyword abstraction algorithm, the pre- mark for carrying out single keyword to corpus of text data to be marked are handled, It can be handled simultaneously by the pre- mark for carrying out single keyword to corpus of text data to be marked automatically based on business rule, it can also A variety of keyword extraction algorithms are chosen simultaneously and carry out keyword mark, and many algorithms annotation results are merged, after fusion Annotation results further intervene by manually marking traffic criteria according to keyword and sentence card, annotation results are saved as into idiom Material, is managed by keyword corpus labeling preparation module, for using when dimensioning algorithm model training, provides unified key Word model accesses standard and completes the mark work of corpus keyword;After the completion of mark task, reaction type keyword marking model Training module is practised for the internal keyword dimensioning algorithm model integrated and external depth enhancing dimensioning algorithm model, passes through pass Keyword algorithm model parameter setting, provide algorithm model study and training, using the keyword corpus marked to keyword Dimensioning algorithm model carries out re -training, and feedback model, which improves, to be updated, and passes through constantly changing between model modification and corpus labeling Generation, automatic feedback adjustment complete new keyword and mark task;Keyword marking model recruitment evaluation module is according to keyword Evaluation index standard constructs keyword evaluation metrics, completes to quantify to evaluation metrics based on keyword indicator rule, establishes mark Algorithm synthesis assessment models, assess model quantification of targets mark effect automatically, and it is automatic to mark task for subsequent key word Recommend optimal marking model.
The present invention has the following beneficial effects: compared with the prior art
The complicated degree of artificial annotation process can be reduced, manual work's cost is mitigated.The present invention is using mainly by keyword corpus labeling Preparation, semi-automatic corpus keyword mark, reaction type keyword marking model learning training, keyword marking model effect are commented Estimate four part of module composition system, different labeled use demand and corpus feature can be directed to, provides and calculated based on autonomous selection adaptation The automatic marking mode of method and more algorithm fusions, more algorithm fusion automatic markings melt more arithmetic results using voting method Conjunction processing, under conditions of ignoring correlation, the performance of integrated approach is better than single method, the pre- mark carried out by this method Work can reduce the complicated degree of artificial annotation process, mitigate manual work's cost, have certain flexibility and higher automation Processing capacity.
Keyword corpus annotating efficiency is high.The present invention is distinguished by the data to separate sources, is realized to keyword The management of corpus;By supporting that backstage integrates the keyword abstractions algorithms such as CHI, LDA, TEXTRANK, TFIDF in real time, for not With keyword corpus, the keywords such as applicable dimensioning algorithm CHI, LDA, TEXTRANKRANK, TFIDF are provided in annotation process The training pattern library of extraction may be selected, and handle the pre- mark that corpus data to be marked carries out single keyword method or mostly crucial The pre- mark processing of word method fusion, introducing manually sentence card link, and system supports the automatic of real-time backstage key word algorithm model Feedback adjustment completes new keyword mark task, is substantially shorter the time for obtaining information, improves the efficiency of acquisition of information, greatly Width improves corpus labeling efficiency.
The present invention is directed to different labeled use demand and corpus feature, autonomous to select adaptation algorithm and carry out automatic marking, By at least one keyword abstraction algorithm in integrated CHI, LDA, TEXTRANK, TFIDF, to corpus of text data to be marked into The pre- mark processing of the processing of pre- mark or the multi-key word fusion of the single keyword of row provides unified keyword models access mark Standard completes corpus keyword and marks work;After the completion of mark task, keyword models are instructed again using mark corpus Practice.Model mark effect is assessed by establishing dimensioning algorithm Integrated Evaluation Model, feeds back keyword models learning training, So that model is reached best effects, promote the accuracy rate of keyword marking model, subsequent newly-increased mark task, by model modification with Continuous iteration between corpus labeling improves corpus keyword mark quality and algorithm model effect, reduces keyword mark Error rate.It realizes that card is sentenced in the intervention of annotation results finally by card link is manually sentenced, manually confirms that link marks keyword Corpus is modified, confirms, is submitted, complete corpus keyword mark work, greatly improved keyword extraction accuracy rate and Accurate precision;By experiment, it was demonstrated that the keyword marks the validity that training extracting tool is applied to mark keyword corpus.
The present invention simplifies user annotation operating process, supports to mark interface by friendly man-machine interactive, supports external Importing, training and the use of model.
Detailed description of the invention
Fig. 1 is keyword corpus labeling training extracting tool schematic illustration of the present invention.
Fig. 2 is the keyword models training managing flow chart of Fig. 1.
To make the object, technical solutions and advantages of the present invention clearer, below with reference to embodiment and attached drawing, to this hair It is bright to be described in further detail.
Specific embodiment
Refering to fig. 1.In preferred embodiment described below, a kind of keyword corpus labeling training extracting tool, comprising: Keyword corpus labeling preparation module, semi-automatic corpus keyword labeling module, reaction type keyword marking model study instruction Practice module and keyword marking model recruitment evaluation module, in which: sea of the keyword corpus labeling preparation module to separate sources Amount corpus data distinguishes, and for the keyword corpus of different purposes, selects keyword corpus source, is set to use In the corpus to be marked of different purposes, i.e., raw corpus;Semi-automatic corpus keyword labeling module creates keyword mark first Task, it is autonomous to select adaptation algorithm and carry out to be based on algorithm model further directed to different labeled use demand and corpus feature Automatic marking, pass through integrated CHI, LDA, the keyword extraction algorithm based on figure sequence, at least one in TEXTRANK, TFIDF A keyword abstraction algorithm, the pre- mark for carrying out single keyword to corpus of text data to be marked is handled, while can pass through base It handles, can also be selected simultaneously in the pre- mark that the automatic marking of business rule carries out single keyword to corpus of text data to be marked It takes a variety of keyword extraction algorithms to carry out keyword mark, and many algorithms annotation results is merged, fused mark As a result card is sentenced by manually further intervening according to keyword mark traffic criteria, annotation results are saved as into idiom material, are passed through Keyword corpus labeling preparation module is managed, and for using when dimensioning algorithm model training, provides unified keyword models Access standard completes corpus keyword and marks work;After the completion of mark task, reaction type keyword marking model learning training Module is calculated for the internal keyword dimensioning algorithm model integrated and external depth enhancing dimensioning algorithm model by keyword Method model parameter setting, provide algorithm model study and training, using the keyword corpus marked to keyword mark calculate Method model carries out re -training, and feedback model, which improves, to be updated, by the continuous iteration between model modification and corpus labeling, automatically Feedback adjustment completes new keyword mark task;Keyword marking model recruitment evaluation module is according to the assessment for being directed to keyword Criterion constructs keyword evaluation metrics, completes to quantify to evaluation metrics based on keyword indicator rule, establishes dimensioning algorithm Integrated Evaluation Model assesses model quantification of targets mark effect automatically, marks task for subsequent key word and recommend automatically Optimal marking model.
The present embodiment provides corpus of text mark preparation modules to complete to corpus to be marked by sources or theme is managed, Preparation is provided for mark task;Semi-automatic corpus keyword labeling module is directed to different labeled use demand and corpus feature, Autonomous selection adaptation algorithm simultaneously carries out automatic marking, realizes that card is sentenced in the intervention of annotation results by manually sentencing card link, specific to walk It is rapid as follows:
Semi-automatic corpus keyword labeling module creates keyword according to separate sources corpus and marks task;For each category Infuse task choosing effect adaptation algorithm model, such as can choose in keyword mark task CHI, LDA, TEXTRANK, The keyword abstractions algorithm such as TFIDF completes automatic marking, and specific dimensioning algorithm can be matched according to corpus automatic marking effect It sets, semi-automatic corpus keyword labeling module can recommend to default automatically according to keyword marking model recruitment evaluation module results Dimensioning algorithm;Semi-automatic corpus keyword labeling module creates keyword mark task first, further directed to different marks Use demand and corpus feature are infused, it is autonomous to select adaptation algorithm and carry out the automatic marking based on algorithm model, by integrated At least one keyword abstraction algorithm in CHI, LDA, keyword extraction algorithm TEXTRANK, TFIDF based on figure sequence, is treated Mark corpus of text data carry out the pre- mark processing of single keyword, while can pass through the automatic marking pair based on business rule Corpus of text data to be marked carry out the pre- mark processing of single keyword.Semi-automatic corpus keyword labeling module is for spy Different mark task creation business mark rule, and mark business rule is managed, marking business rule here mainly includes Business dictionary and regular expression for matched character string, such as: time key dates: crucial geographical.Directly by regular expressions Formula is defined as variable such as reg, and dim reg as expreg is directly set after choosing Microsoft scripting runtime It sets and dictionary object is defined as variable, dim d as dictionary.The substantially matching process of regular expression is: successively taking Charactor comparison in expression formula and text out, if each character can match, successful match;Once there is matching unsuccessful Character then it fails to match.Mark personnel carry out automatic marking to corpus using mark business rule;To based on algorithm model Automatic marking result and automatic marking result based on business rule carry out fusion treatment, can also choose a variety of keywords simultaneously and mention It takes algorithm to carry out keyword mark, and many algorithms annotation results is merged, fused annotation results are by manually pressing Further intervene according to keyword mark traffic criteria and sentence card, modification, confirmation and preservation, annotation results are saved as into idiom material, are led to It crosses keyword corpus labeling preparation module to be managed, for using when dimensioning algorithm model training, provides unified keyword mould Type accesses standard and completes the mark work of corpus keyword.
Refering to Fig. 2.Reaction type keyword marking model learning training module is calculated for the internal keyword mark integrated Method model and external depth enhance dimensioning algorithm model, are arranged by key word algorithm model parameter, provide model learning training. In keyword models training managing process, what the reading of reaction type keyword marking model learning training module had marked is used to instruct Experienced corpus selects key algorithm training, for can not training algorithm, no training process simultaneously terminates, using having marked corpus number According to CHI, LDA, TEXTRANK, TFIDF etc. can training algorithm carry out off-line training, call unified training pattern interface Train, Keyword models sequential file Kryo is generated, model accuracy is made to reach best.After generating keyword models sequential file Kryo, Reaction type keyword marking model learning training module judges whether to save keyword models, no, terminates, is then by unified model Access interface imports external algorithm model, and external algorithm model is updated or is exported, and saving includes algorithm title, model name Claim, the keyword models file of serializing model file, and more new keywords training pattern table;Using trained model to flat Model in platform for keyword mark is updated, and completes new keyword mark task.In keyword models update, instead Feedback formula model learning training module starts Keyword Services, selects the key word algorithm of pre-updated, if the keyword of selection is calculated Method is not trainable algorithm, then terminates;CHI, LDA, TEXTRANK, TFIDF according to selection etc. can training algorithm, pass through solution Analyse configuration file in more new keywords switch to determine whether update keyword models, it is no, terminate.It is then according to keyword mould Type title and keyword training pattern table read designated key word model file, and carry out to the keyword models file of reading Unserializing completes the load of keyword models, terminates program.
Marking model recruitment evaluation module provides the methods of model evaluation index building mark, building rule, quantification of targets, It supports to assess model mark effect by constructing dimensioning algorithm Integrated Evaluation Model automatically, the specific steps are as follows: mark Single index algorithm is arranged according to criterion in modelling effect evaluation module;Index is quantified according to index computation rule, Dimensioning algorithm Integrated Evaluation Model is constructed using tissue corresponding index according to different labeled task;The integrated value that hits the target calculates, Marking model effect is fed back.
The quality and evaluation criterion of keyword extraction are at home and abroad there has been no unified evaluation method, because of text data Selection has biggish subjectivity, so the present invention judges two ways using machine quantitative analysis and human subjective to close The quality and evaluation criterion that keyword extracts.The index of machine quantitative analysis is most common be still accuracy rate P (Precision), Recall rate R (Recall), reconcile keyword extraction accuracy rate and recall rate average value F, according to the demand is applied, to keyword What extraction accuracy rate and recall rate were weighted considers value E, wherein
Accuracy rate and recall rate are commonly referred to as the relationship of inverse ratio.Accuracy rate is improved by certain methods, will lead to recall rate decline, instead ?.In order to define application system for the different demands of accuracy rate and recall rate, can provide a weighted value to its into Row weighting is considered, so that obtain being weighted keyword extraction accuracy rate and recall rate considers value E:Wherein, b is the weight being added, and b is bigger, then it represents that the weight for considering middle accuracy rate of E value is bigger, on the contrary Then the weight of recall rate is bigger.
In addition to this, there are two common index reference data value binarypreferencemeasure (Bpref) and right The mechanism evaluation index MRR (meanreciprocalrank averaged reciprocals ranking) that searching algorithm is evaluated.Reference data value Bpref is the evaluation metrics for considering collating sequence.For a document, if having R in the keyword extracted at M is standard Answer, it is therein accurately extract indicated with r, mistake extract indicated with n, then reference data value Bpref passes through following public affairs Formula calculates:
Searching algorithm evaluation mechanism evaluation index MRR is used to measure the ranking of each first keyword accurately recommended of document Situation is the evaluation index for being directed to document sets.For a document d, rank is useddTo indicate that first is recommended to close by accurate The ranking position of keyword, then evaluation index MRR is defined as:
Wherein, D is the collection of document for carrying out keyword abstraction test.
By by sources or theme is managed, providing preparation to corpus to be marked for mark task;By integrated CHI, The keyword abstractions algorithm such as LDA, TEXTRANK, TFIDF completes the semi-automatic mark of keyword corpus, in annotation process There is provided applicable dimensioning algorithm may be selected, and carrying out keyword to corpus data to be marked, mark is handled in advance;Finally by artificial true Recognize link mark corpus is modified, confirms and submitted, completes corpus labeling work.After the completion of mark task, mark is used It infuses corpus and re -training is carried out to model.Model mark effect is assessed by establishing dimensioning algorithm Integrated Evaluation Model, Feedback model learning training makes model reach best effects, is used for subsequent newly-increased mark task, passes through model modification and corpus mark Continuous iteration between note improves corpus labeling quality and algorithm model effect.
The above is present pre-ferred embodiments, it has to be noted that the present invention will be described for above-described embodiment, so And the present invention is not limited thereto, and those skilled in the art can be designed when being detached from scope of the appended claims Alternative embodiment.For those skilled in the art, without departing from the spirit and substance in the present invention, Various changes and modifications can be made therein, these variations and modifications are also considered as protection scope of the present invention.

Claims (10)

1. a kind of keyword corpus labeling training extracting tool, comprising: keyword corpus labeling preparation module, semi-automatic corpus Keyword labeling module, reaction type keyword marking model learning training module and keyword marking model recruitment evaluation module, It is characterized by: keyword corpus labeling preparation module distinguishes the magnanimity corpus data of separate sources, for different use The keyword corpus on way selects keyword corpus source, is set to corpus to be marked for different purposes, i.e., raw language Material;Semi-automatic corpus keyword labeling module creates keyword mark task first, for different labeled use demand and language Material feature, it is autonomous to select adaptation algorithm and carry out the automatic marking based on algorithm model, by integrated CHI, LDA, based on figure row At least one of the keyword extraction algorithm of sequence, TEXTRANK, TFIDF keyword abstraction algorithm, to corpus of text number to be marked It is handled according to the pre- mark for carrying out single keyword, or chooses above-mentioned a variety of keyword extraction algorithms simultaneously and carry out keyword mark, And many algorithms annotation results are merged, after the completion of mark task, reaction type keyword marking model learning training mould Block passes through key word algorithm for the internal keyword dimensioning algorithm model integrated and external depth enhancing dimensioning algorithm model Model parameter setting, provide algorithm model study and training, using the keyword corpus marked to keyword dimensioning algorithm Model carries out re -training, and feedback model, which improves, to be updated, automatic anti-by the continuous iteration between model modification and corpus labeling Feedback adjustment completes new keyword and marks task;Keyword marking model recruitment evaluation module is according to the evaluation index mark of keyword Quasi- building keyword evaluation metrics are completed to quantify based on keyword indicator rule to evaluation metrics, establish dimensioning algorithm synthesis and comment Estimate model, model quantification of targets mark effect is assessed automatically, task is marked for subsequent key word and recommends optimal mark automatically Injection molding type.
2. keyword corpus labeling training extracting tool as described in claim 1, it is characterised in that: fused annotation results Card is sentenced by manually further intervening according to keyword mark traffic criteria, and annotation results are saved as into idiom material, pass through key Word corpus labeling preparation module is managed, and for using when dimensioning algorithm model training, provides unified keyword models access Standard completes corpus keyword and marks work.
3. keyword corpus labeling training extracting tool as described in claim 1, it is characterised in that: semi-automatic corpus is crucial Word labeling module is managed for special mark task creation business mark rule, and to mark business rule, is marked here Business rule mainly includes business dictionary and the regular expression for matched character string.
4. keyword corpus labeling training extracting tool as described in claim 1, it is characterised in that: reaction type keyword mark Model learning training module read marked for training corpus, select key algorithm training, for can not training algorithm, No training process simultaneously terminates, using marked corpus data to CHI, LDA, TEXTRANK, TFIDF can training algorithm carry out it is offline Training calls unified training pattern interface Train, generates keyword models sequential file Kryo, reach model accuracy most It is good.
5. keyword corpus labeling training extracting tool as claimed in claim 3, it is characterised in that: generate keyword models sequence After column file Kryo, reaction type keyword marking model learning training module judges whether to save keyword models, no, terminates, It is then to import external algorithm model by unified model access interface, external algorithm model is updated or is exported, preservation includes Algorithm title, model name, the keyword models file for serializing model file, and more new keywords training pattern table;It uses Trained model is updated the model that keyword mark is used in platform, completes new keyword mark task.
6. keyword corpus labeling training extracting tool as claimed in claim 4, it is characterised in that: updated in keyword models In, reaction type model learning training module starts Keyword Services, the key word algorithm of pre-updated is selected, if the key of selection Word algorithm is not trainable algorithm, then terminates;According to CHI, LDA, TEXTRANK, TFIDF of selection can training algorithm, pass through The switch of more new keywords in configuration file is parsed to determine whether updating keyword models, no, end, is then according to keyword Model name and keyword training pattern table, read designated key word model file, and to the keyword models file of reading into Row unserializing completes the load of keyword models, terminates program.
7. keyword corpus labeling training extracting tool as described in claim 1, it is characterised in that: marking model recruitment evaluation Single index algorithm is arranged according to criterion in module;Index is quantified according to index computation rule, according to different labeled Task constructs dimensioning algorithm Integrated Evaluation Model using tissue corresponding index;The integrated value that hits the target calculates, and imitates to marking model Fruit is fed back.
8. keyword corpus labeling training extracting tool as described in claim 1, it is characterised in that: use machine quantitative analysis Two ways is judged with human subjective to carry out the quality and evaluation criterion of keyword extraction.
9. keyword corpus labeling training extracting tool as claimed in claim 7, it is characterised in that: the finger of machine quantitative analysis Mark is accuracy rate P (Precision), recall rate R (Recall), F value, E value, in which:
Accuracy rate
Recall rate
The average value of reconciliation keyword extraction accuracy rate and recall rate
10. keyword corpus labeling training extracting tool as described in claim 1, it is characterised in that: accuracy rate and recall rate The commonly referred to as relationship of inverse ratio.Accuracy rate is improved by certain methods, in order to define application system for accuracy rate P and recall rate R Different demands, provide a weighted value and considered to what its accuracy rate P and recall rate R were weighted, thus obtain recall rate into Value E is considered in row weighting:
Wherein, b is the weight being added, and b is bigger, then it represents that the weight for considering middle accuracy rate of E value is bigger, on the contrary then recall rate Weight is bigger.
CN201910455064.3A 2019-05-29 2019-05-29 Keyword corpus labeling training extraction system Active CN110298033B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910455064.3A CN110298033B (en) 2019-05-29 2019-05-29 Keyword corpus labeling training extraction system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910455064.3A CN110298033B (en) 2019-05-29 2019-05-29 Keyword corpus labeling training extraction system

Publications (2)

Publication Number Publication Date
CN110298033A true CN110298033A (en) 2019-10-01
CN110298033B CN110298033B (en) 2022-07-08

Family

ID=68027297

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910455064.3A Active CN110298033B (en) 2019-05-29 2019-05-29 Keyword corpus labeling training extraction system

Country Status (1)

Country Link
CN (1) CN110298033B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110781290A (en) * 2019-10-10 2020-02-11 南京摄星智能科技有限公司 Extraction method of structured text abstract of long chapter
CN110968695A (en) * 2019-11-18 2020-04-07 罗彤 Intelligent labeling method, device and platform based on active learning of weak supervision technology
CN111125312A (en) * 2019-12-24 2020-05-08 深圳视界信息技术有限公司 Text labeling method and system
CN111143577A (en) * 2019-12-27 2020-05-12 北京百度网讯科技有限公司 Data annotation method, device and system
CN111476034A (en) * 2020-04-07 2020-07-31 同方赛威讯信息技术有限公司 Legal document information extraction method and system based on combination of rules and models
CN111859854A (en) * 2020-06-11 2020-10-30 第四范式(北京)技术有限公司 Data annotation method, device and equipment and computer readable storage medium
CN111859872A (en) * 2020-07-07 2020-10-30 中国建设银行股份有限公司 Text labeling method and device
CN112269877A (en) * 2020-10-27 2021-01-26 维沃移动通信有限公司 Data labeling method and device
CN112307175A (en) * 2020-12-02 2021-02-02 龙马智芯(珠海横琴)科技有限公司 Text processing method, text processing device, server and computer readable storage medium
CN112365159A (en) * 2020-11-11 2021-02-12 福建亿榕信息技术有限公司 Deep neural network-based backup cadre recommendation method and system
CN112395395A (en) * 2021-01-19 2021-02-23 平安国际智慧城市科技股份有限公司 Text keyword extraction method, device, equipment and storage medium
CN112508376A (en) * 2020-11-30 2021-03-16 中国科学院深圳先进技术研究院 Index system construction method
CN112632284A (en) * 2020-12-30 2021-04-09 上海明略人工智能(集团)有限公司 Information extraction method and system for unlabeled text data set
CN112862458A (en) * 2021-03-02 2021-05-28 岭东核电有限公司 Nuclear power test procedure supervision method and device, computer equipment and storage medium
CN113536783A (en) * 2021-07-14 2021-10-22 福建亿榕信息技术有限公司 Model-based new word discovery method
CN115511668A (en) * 2022-10-12 2022-12-23 金华智扬信息技术有限公司 Case supervision method, device, equipment and medium based on artificial intelligence
CN118095251A (en) * 2024-04-23 2024-05-28 北京国际大数据交易有限公司 Offline text data evaluation method and device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106997344A (en) * 2017-03-31 2017-08-01 成都数联铭品科技有限公司 Keyword abstraction system
CN108197098A (en) * 2017-11-22 2018-06-22 阿里巴巴集团控股有限公司 A kind of generation of keyword combined strategy and keyword expansion method, apparatus and equipment
CN108268447A (en) * 2018-01-22 2018-07-10 河海大学 A kind of mask method of Tibetan language name entity
US20180196870A1 (en) * 2017-01-12 2018-07-12 Microsoft Technology Licensing, Llc Systems and methods for a smart search of an electronic document
CN108595460A (en) * 2018-01-05 2018-09-28 中译语通科技股份有限公司 Multichannel evaluating method and system, the computer program of keyword Automatic
CN108763213A (en) * 2018-05-25 2018-11-06 西南电子技术研究所(中国电子科技集团公司第十研究所) Theme feature text key word extracting method
CN108960338A (en) * 2018-07-18 2018-12-07 苏州科技大学 The automatic sentence mask method of image based on attention-feedback mechanism
CN109710728A (en) * 2018-11-26 2019-05-03 西南电子技术研究所(中国电子科技集团公司第十研究所) News topic automatic discovering method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180196870A1 (en) * 2017-01-12 2018-07-12 Microsoft Technology Licensing, Llc Systems and methods for a smart search of an electronic document
CN106997344A (en) * 2017-03-31 2017-08-01 成都数联铭品科技有限公司 Keyword abstraction system
CN108197098A (en) * 2017-11-22 2018-06-22 阿里巴巴集团控股有限公司 A kind of generation of keyword combined strategy and keyword expansion method, apparatus and equipment
CN108595460A (en) * 2018-01-05 2018-09-28 中译语通科技股份有限公司 Multichannel evaluating method and system, the computer program of keyword Automatic
CN108268447A (en) * 2018-01-22 2018-07-10 河海大学 A kind of mask method of Tibetan language name entity
CN108763213A (en) * 2018-05-25 2018-11-06 西南电子技术研究所(中国电子科技集团公司第十研究所) Theme feature text key word extracting method
CN108960338A (en) * 2018-07-18 2018-12-07 苏州科技大学 The automatic sentence mask method of image based on attention-feedback mechanism
CN109710728A (en) * 2018-11-26 2019-05-03 西南电子技术研究所(中国电子科技集团公司第十研究所) News topic automatic discovering method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
HANGFENG HE等: "A Unified Model for Cross-Domain and Semi-Supervised Named Entity Recognition in Chinese Social Media,Jinseok Nam Semi-Supervised Neural Networks for Nested Named Entity Recognition", 《AAAI》 *
MATTHEW E. PETERS等: "Semi-supervised sequence tagging with bidirectional language models", 《ARXIV》 *
冯浩哲等: "面向 3D CT 影像处理的无监督推荐标注算法", 《计算机辅助设计与图形学学报》 *
刘晓娟等: "国外知识抽取***研究", 《情报科学》 *
王敏等: "教学视频的文本语义镜头分割和标注", 《数据采集与处理》 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110781290A (en) * 2019-10-10 2020-02-11 南京摄星智能科技有限公司 Extraction method of structured text abstract of long chapter
CN110968695A (en) * 2019-11-18 2020-04-07 罗彤 Intelligent labeling method, device and platform based on active learning of weak supervision technology
CN111125312A (en) * 2019-12-24 2020-05-08 深圳视界信息技术有限公司 Text labeling method and system
CN111143577A (en) * 2019-12-27 2020-05-12 北京百度网讯科技有限公司 Data annotation method, device and system
CN111143577B (en) * 2019-12-27 2023-06-16 北京百度网讯科技有限公司 Data labeling method, device and system
US11860838B2 (en) 2019-12-27 2024-01-02 Beijing Baidu Netcom Science And Teciinology Co., Ltd. Data labeling method, apparatus and system, and computer-readable storage medium
CN111476034A (en) * 2020-04-07 2020-07-31 同方赛威讯信息技术有限公司 Legal document information extraction method and system based on combination of rules and models
CN111859854A (en) * 2020-06-11 2020-10-30 第四范式(北京)技术有限公司 Data annotation method, device and equipment and computer readable storage medium
CN111859872A (en) * 2020-07-07 2020-10-30 中国建设银行股份有限公司 Text labeling method and device
CN112269877A (en) * 2020-10-27 2021-01-26 维沃移动通信有限公司 Data labeling method and device
CN112365159A (en) * 2020-11-11 2021-02-12 福建亿榕信息技术有限公司 Deep neural network-based backup cadre recommendation method and system
CN112508376A (en) * 2020-11-30 2021-03-16 中国科学院深圳先进技术研究院 Index system construction method
CN112307175A (en) * 2020-12-02 2021-02-02 龙马智芯(珠海横琴)科技有限公司 Text processing method, text processing device, server and computer readable storage medium
CN112632284A (en) * 2020-12-30 2021-04-09 上海明略人工智能(集团)有限公司 Information extraction method and system for unlabeled text data set
CN112395395A (en) * 2021-01-19 2021-02-23 平安国际智慧城市科技股份有限公司 Text keyword extraction method, device, equipment and storage medium
CN112395395B (en) * 2021-01-19 2021-05-28 平安国际智慧城市科技股份有限公司 Text keyword extraction method, device, equipment and storage medium
CN112862458A (en) * 2021-03-02 2021-05-28 岭东核电有限公司 Nuclear power test procedure supervision method and device, computer equipment and storage medium
CN113536783A (en) * 2021-07-14 2021-10-22 福建亿榕信息技术有限公司 Model-based new word discovery method
CN115511668A (en) * 2022-10-12 2022-12-23 金华智扬信息技术有限公司 Case supervision method, device, equipment and medium based on artificial intelligence
CN115511668B (en) * 2022-10-12 2023-09-08 金华智扬信息技术有限公司 Case supervision method, device, equipment and medium based on artificial intelligence
CN118095251A (en) * 2024-04-23 2024-05-28 北京国际大数据交易有限公司 Offline text data evaluation method and device
CN118095251B (en) * 2024-04-23 2024-06-18 北京国际大数据交易有限公司 Offline text data evaluation method and device

Also Published As

Publication number Publication date
CN110298033B (en) 2022-07-08

Similar Documents

Publication Publication Date Title
CN110298033A (en) Keyword corpus labeling trains extracting tool
CN110633409B (en) Automobile news event extraction method integrating rules and deep learning
CN110502621A (en) Answering method, question and answer system, computer equipment and storage medium
US8676815B2 (en) Suffix tree similarity measure for document clustering
CN104216913B (en) Question answering method, system and computer-readable medium
CN110298032A (en) Text classification corpus labeling training system
CN110287481A (en) Name entity corpus labeling training system
CN100595760C (en) Method for gaining oral vocabulary entry, device and input method system thereof
CN105045875B (en) Personalized search and device
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
CN108717433A (en) A kind of construction of knowledge base method and device of programming-oriented field question answering system
CN110287482B (en) Semi-automatic participle corpus labeling training device
CN103678576A (en) Full-text retrieval system based on dynamic semantic analysis
CN103838833A (en) Full-text retrieval system based on semantic analysis of relevant words
CN103049435A (en) Text fine granularity sentiment analysis method and text fine granularity sentiment analysis device
CN110175585B (en) Automatic correcting system and method for simple answer questions
CN105393263A (en) Feature completion in computer-human interactive learning
CN112307153B (en) Automatic construction method and device of industrial knowledge base and storage medium
CN102184262A (en) Web-based text classification mining system and web-based text classification mining method
CN109145260A (en) A kind of text information extraction method
CN103324700A (en) Noumenon concept attribute learning method based on Web information
CN112051986B (en) Code search recommendation device and method based on open source knowledge
CN110888991A (en) Sectional semantic annotation method in weak annotation environment
CN114090861A (en) Education field search engine construction method based on knowledge graph
CN101556596A (en) Input method system and intelligent word making method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant