CN110298033A - Keyword corpus labeling trains extracting tool - Google Patents
Keyword corpus labeling trains extracting tool Download PDFInfo
- Publication number
- CN110298033A CN110298033A CN201910455064.3A CN201910455064A CN110298033A CN 110298033 A CN110298033 A CN 110298033A CN 201910455064 A CN201910455064 A CN 201910455064A CN 110298033 A CN110298033 A CN 110298033A
- Authority
- CN
- China
- Prior art keywords
- keyword
- corpus
- model
- algorithm
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of keyword corpus labeling training extracting tool disclosed by the invention, it is desirable to provide one kind can reduce the complicated degree of artificial annotation process, and can improve the mark training tool of magnanimity keyword corpus annotating efficiency and accuracy rate.The technical scheme is that: keyword corpus labeling preparation module distinguishes the magnanimity corpus data of separate sources, semi-automatic corpus keyword labeling module creation keyword marks task, autonomous selection adaptation algorithm simultaneously carries out the automatic marking based on algorithm model, by integrating CHI, LDA, TEXTRANK, at least one of TFIDF keyword abstraction algorithm, pre- mark processing is carried out to corpus of text data to be marked, and many algorithms annotation results are merged, after the completion of mark task, reaction type keyword marking model learning training module is trained keyword dimensioning algorithm model;Keyword marking model recruitment evaluation module assesses model quantification of targets mark effect automatically.
Description
Technical field
The present invention relates to Text Mining Technology fields more particularly to keyword corpus semi-automation mark training to extract work
Tool.
Background technique
In natural language processing field, the text file of magnanimity is handled it is crucial that the most concerned problem of user is mentioned
It takes out.Regardless of being that can often spy upon the theme of entire text by several keywords for long text or short text
Thought.At the same time, searched for whether based on the recommendation of text or text based, for text key word dependence also very
Greatly, the order of accuarcy of keyword extraction is directly related to the final effect of recommender system or search system.Therefore, keyword mentions
Taking in text mining field is a critically important part.The rapid development of network provides easy acquisition of information way for people
The quantity of diameter, the electronic documents such as webpage, mail, e-book is more and more.But the information resources of this explosive growth lack
The structuring of content, makes people while obtaining bulk information, also has to take much time beam reading and arranges this
A little information greatly reduce the recall precision of people.Therefore, how numerous and jumbled unordered resource to be organized, improves and utilizes letter
The efficiency of breath makes the key message that people are easy, fast and accurately obtain these texts just become abnormal important.Keyword is certainly
Dynamic extraction is widely used in all fields.Especially in knowledge excavation, information retrieval, text cluster, text classification etc. field,
Keyword automatic indexing is even more basis and core technology.And in fields such as relevant feedback, automatic fitration, event detection and tracking,
Keyword automatic indexing technology is also to play the crucial effect of comparison.It can be said that keyword automatic indexing technology is to carry out institute
There is the element task of text Automatic analysis, is all essential in the work of many text analyzings.Keyword abstraction
Automatic abstract, information retrieval, text classification, in terms of have a very important role.How intelligent, quick,
Effectively have become a problem in the urgent need to address in current computer field from acquisition information on internet.Keyword
Extraction is to realize the important means quickly and precisely obtained to internet information.Keyword has some characteristics, keyword one
As be all noun or noun phrase;, keyword generally will not start or be terminated with stop words;The length of keyword is generally not
It can be too long.Keyword feature will consider the problems of in the work of keyword abstraction that the exploitation and selection of feature are one when selecting
A emphasis is also a difficult point, and feature is selected well and is badly directly related to the judgement of keyword.
In recent years, the high speed development of obtaining means is acquired with big data, the explosive growth of text information, makes on network
The difficulty that the information needed must be obtained increasingly increases.It excavates to maximize to be worth from data and becomes especially urgent, this is to big data
Intelligent analysis propose completely new demand.In order to handle the information resources of high speed expansion, manual processing method becomes not suit
It is practical, it is therefore desirable to help people effectively to manage using automatic processing method, organizational information, to solve abundant information knowledge
Poor problem.In this context, the technologies such as machine learning, deep learning using upper fast development and obtain huge in big data
Success, the model algorithm that technology bottom uses more need to rely on based on a large amount of data mark corpus and train branch
Support.Majority keyword abstraction algorithm is the importance for utilizing the statistical information grammatical term for the character of word at present, and choosing is more than certain threshold value
Keyword of the word as article.But statistical method calculation amount is excessive, and needs a large amount of statistics corpus.Based on this side
Method proposes multiple keyword measure functions, including TF/IDF, entropy function, breadth coefficient etc..Many machine learning algorithms are also answered
For in keyword abstraction, such as CHI, TFIDF both can be used as the method that Feature Selection can also be used as weight calculation, different
Place is that TFIDF can be used for any text collection, and the label that CHI then needs text to have tag along sort could calculate.
TextRank is put forward as keyword abstraction method, is also had tried to later as weighing computation method, still
The computation complexity of TextRank is very high.
The work of mass data corpus labeling has an important influence the training of algorithm model, while as big data analysis mistake
Basic work in journey, the links such as the main support daily research and development of big data, algorithm tuning, demonstration and verification are that big data is dug
Dig the key foundation of analysis.Keyword is to statement text the main contents include the vocabulary of essential meaning, is to meet text mark
Draw or retrieve work and select from autograph, abstract and text the word or phrase come.Keyword extraction is by core
The statistics and semantic analysis of word, selection is suitable from single text or a text set, can express in theme completely
The process of the feature item collection of appearance.Since keyword is the most basic unit for indicating text subject meaning, so in autoabstract, letter
It ceases the natural language processings such as retrieval, text cluster, automatic question answering, Topic Tracking and Chinese information processing field usually will be advanced
Row keyword extraction, and keyword extraction also has important clue value for information monitoring and tracking.Part of speech is to pass through
The result obtained after participle, syntactic analysis.In existing keyword, most keywords are noun or gerund.Generally
In the case of, there is very big value in the position that word occurs for word.For example, title, abstract inherently author summarize
Article central idea, therefore appear in these place words have certain representativeness, be more likely to become keyword.But
It is that, because the habit of each author is different, writing mode is different, and the position of crucial sentence would also vary from, so this is also
A method of very wide in range obtains keyword, will not be used alone under normal circumstances.Judge a word is in an article
No important, a measurement index being readily apparent that is exactly word frequency, and important word often repeatedly occurs in article.But another party
Face, it is certain important without being the word more than frequency of occurrence, because some words all frequently occur in various articles, its importance
Certainly it is strong to be not so good as the word importance that those are only frequently occurred in certain article.Word frequency indicates the frequency that a word occurs in the text
Rate.As soon as it is general it is believed that if word occur in the text frequently, then this word is more possible as article
Core word.The number that word frequency simple geo-statistic word occurs in the text, still, relying only on the obtained keyword of word frequency has
Very big uncertainty, the text long for length, this method have very big noise.Key based on statistical nature
The calculating for focusing on characteristic quantification index of word, the result that different quantizating index obtains also are not quite similar.Meanwhile it is different
Quantizating index also has its respective advantage and disadvantage, in practical applications, usually in such a way that different quantizating index combines
Topk word is obtained as keyword.Keyword extraction has very wide application, existing method in text mining field
It has certain problems.Method of the prior art based on statistics and machine learning, the method effect of machine learning more rely on
In manually mark corpus, i.e., model parameter is trained according to the data (corpus marked) observed, in the participle stage
The probability that various participles occur is calculated by model again, using the big word segmentation result of probability as termination fruit.Based on machine learning side
The premise that method can be implemented is the sufficiently large knowledge base of data volume to be established or training library.The problem of due to current knowledge learning
It not yet fundamentally solves, the update of knowledge base is very slow, does not catch up with current scientific development.Provided by the data set marked
Limited information, and the manually mark of sample is time-consuming and laborious, and it is too big to carry out large-scale mark consumption.What is be easy to get does not mark sample
Sample is more relative to having marked for this (webpage on such as internet) quantity, and also close to the data on entire sample space point
Cloth.Mark samples more as far as possible is provided and needs arduous and slow hand labor, affects the building of whole system, this is just generated
The problem of one mark bottleneck.The thought of keyword abstraction algorithm based on statistical nature is the statistics using word in document
The keyword of information extraction document.Text is usually obtained into the set of candidate word by pretreatment, then uses characteristic value amount
The mode of change obtains keyword from candidate collection.Common sequence labelling model has HMM and CRF.This kind of segmentation methods can be very
Benefit manages ambiguity and unregistered word problem, and effect is better than preceding a kind of effect, but needs a large amount of artificial labeled data, and compared with
Slow participle speed.The keyword abstraction algorithm of supervision is to regard keyword abstraction algorithm as two classification problems, judges text
Word or phrase in shelves are or are not keywords.Since being classification problem, it is necessary to provide the training language marked
Material carries out keyword pumping to the document for needing extracting keywords according to model using training corpus training keyword extraction model
It takes.
Traditional keyword abstraction method is divided into two kinds, respectively unsupervised approaches and has measure of supervision.It is wherein unsupervised
Method includes the methods of TF-IDF, Chi-squared, TextRank, LDA, and has measure of supervision to turn keyword abstraction problem
Be changed to judge each word whether be keyword two classification problems, before once someone pass through NaiveBayes and decision tree C4.5
Etc. have measure of supervision carry out keyword abstraction.Unsupervised approaches and there is measure of supervision respectively and there are its Pros and Cons: unsupervised approaches
It does not need manually to mark training set, therefore more quick, but is sorted since much information can not be comprehensively utilized to candidate word, institute
With may not be if any measure of supervision in effect;And there is measure of supervision that can close by training learning regulation much information for judgement
The influence degree of keyword, therefore effect is more preferable, but in data age now, mark training set and take time and effort very much.Have
The text key word extraction algorithm of supervision, which is disadvantageous in that, needs high cost of labor.Third class is by allowing computer mould
Understanding of the personification to sentence achievees the effect that identify word, due to the complexity of Chinese semantic meaning, it is difficult to by various language message tissues
It is time-consuming and laborious using artificial method due to needing to mark a large amount of training corpus at the form that machine can identify, at present this
Kind Words partition system is also in experimental stage.It is all using the word after pretreated as section in the building process of linguistic network figure
Point, the relationship between word and word is as side.In linguistic network figure, while while between the general word of weight between the degree of association come
It indicates.When obtaining keyword using linguistic network figure, need to assess the importance of each node, then according to importance
Node is ranked up, chooses word representated by TopK node as keyword.The characteristics of due to Chinese language itself, do not have
Explicit word boundary increases certain difficulty for crucial word string automatic indexing task again, and keyword abstraction model needs more
Training corpus can be only achieved stable effect.In practical applications, due to the complexity of application environment, for different types of
Text, such as long text and short text, the effect obtained with same text key word extracting method are simultaneously identical.In practical application
It will be different for algorithm used by different conditions environmentals, have under all circumstances very well without certain one kind algorithm
Effect.Meanwhile the accuracy segmented in engineering for the pretreatment of text and text also has very big dependence.For text
Wrong word, the information such as alternative word need to be resolved in pretreatment stage, the selection of segmentation methods, unregistered word and discrimination
The identification of adopted word is prominent for keyword to a certain extent to extract meeting and very big influence.Keyword extraction is one and seems letter
It is single, but very intractable task in practical applications.Due to Chinese opinion type subjective texts tagged corpus contain participle,
The mark of the bulk informations such as part of speech, dependence, semanteme, word concept, opinion, completion is usually relatively complex.In order to mitigate mark
The burden of personnel improves the Efficiency and accuracy of mark, reduces the error rate of mark, it is necessary to develop a for key words
The automatic marking tool of material assists the work of mark personnel.Keyword corpus is relatively deficient in field at present, and key words
Material mark work mainly completed at present by manually marking, be widely present corpus labeling is of poor quality, annotation process is cumbersome, mark
The problems such as low efficiency, high cost of human resources.Meanwhile that there are mask methods is single for existing keyword corpus annotation tool, is difficult
Mask method model is carried out the drawback such as to automatically update, therefore, there is an urgent need to it is a set of be capable of indirect labor mark the half of corpus from
Dynamic keyword mark solves problem above with training platform.
Summary of the invention
Goal of the invention of the invention is to be conceived to solve to deposit in above-mentioned keyword corpus labeling and training process using corpus
The drawbacks of, the complicated degree of artificial annotation process can be reduced by providing one kind, mitigate manual work's cost, and can improve magnanimity keyword
The semi-automatic keyword corpus labeling training tool of corpus labeling efficiency and accuracy rate.
Above-mentioned purpose of the invention can be achieved by following technical proposals: a kind of keyword corpus labeling is trained to be mentioned
Take tool, comprising: keyword corpus labeling preparation module, semi-automatic corpus keyword labeling module, reaction type keyword mark
Injection molding type learning training module and keyword marking model recruitment evaluation module, it is characterised in that: keyword corpus labeling prepares
Module distinguishes the magnanimity corpus data of separate sources, for the keyword corpus of different purposes, comes to keyword corpus
Source selection, is set to corpus to be marked for different purposes, i.e., raw corpus;Semi-automatic corpus keyword labeling module
Creation keyword first marks task, autonomous to select adaptation algorithm and carry out for different labeled use demand and corpus feature
Automatic marking based on algorithm model, by integrate CHI, LDA, based on figure sequence keyword extraction algorithm, TEXTRANK,
At least one of TFIDF keyword abstraction algorithm, the pre- mark for carrying out single keyword to corpus of text data to be marked are handled,
It can be handled simultaneously by the pre- mark for carrying out single keyword to corpus of text data to be marked automatically based on business rule, it can also
A variety of keyword extraction algorithms are chosen simultaneously and carry out keyword mark, and many algorithms annotation results are merged, after fusion
Annotation results further intervene by manually marking traffic criteria according to keyword and sentence card, annotation results are saved as into idiom
Material, is managed by keyword corpus labeling preparation module, for using when dimensioning algorithm model training, provides unified key
Word model accesses standard and completes the mark work of corpus keyword;After the completion of mark task, reaction type keyword marking model
Training module is practised for the internal keyword dimensioning algorithm model integrated and external depth enhancing dimensioning algorithm model, passes through pass
Keyword algorithm model parameter setting, provide algorithm model study and training, using the keyword corpus marked to keyword
Dimensioning algorithm model carries out re -training, and feedback model, which improves, to be updated, and passes through constantly changing between model modification and corpus labeling
Generation, automatic feedback adjustment complete new keyword and mark task;Keyword marking model recruitment evaluation module is according to keyword
Evaluation index standard constructs keyword evaluation metrics, completes to quantify to evaluation metrics based on keyword indicator rule, establishes mark
Algorithm synthesis assessment models, assess model quantification of targets mark effect automatically, and it is automatic to mark task for subsequent key word
Recommend optimal marking model.
The present invention has the following beneficial effects: compared with the prior art
The complicated degree of artificial annotation process can be reduced, manual work's cost is mitigated.The present invention is using mainly by keyword corpus labeling
Preparation, semi-automatic corpus keyword mark, reaction type keyword marking model learning training, keyword marking model effect are commented
Estimate four part of module composition system, different labeled use demand and corpus feature can be directed to, provides and calculated based on autonomous selection adaptation
The automatic marking mode of method and more algorithm fusions, more algorithm fusion automatic markings melt more arithmetic results using voting method
Conjunction processing, under conditions of ignoring correlation, the performance of integrated approach is better than single method, the pre- mark carried out by this method
Work can reduce the complicated degree of artificial annotation process, mitigate manual work's cost, have certain flexibility and higher automation
Processing capacity.
Keyword corpus annotating efficiency is high.The present invention is distinguished by the data to separate sources, is realized to keyword
The management of corpus;By supporting that backstage integrates the keyword abstractions algorithms such as CHI, LDA, TEXTRANK, TFIDF in real time, for not
With keyword corpus, the keywords such as applicable dimensioning algorithm CHI, LDA, TEXTRANKRANK, TFIDF are provided in annotation process
The training pattern library of extraction may be selected, and handle the pre- mark that corpus data to be marked carries out single keyword method or mostly crucial
The pre- mark processing of word method fusion, introducing manually sentence card link, and system supports the automatic of real-time backstage key word algorithm model
Feedback adjustment completes new keyword mark task, is substantially shorter the time for obtaining information, improves the efficiency of acquisition of information, greatly
Width improves corpus labeling efficiency.
The present invention is directed to different labeled use demand and corpus feature, autonomous to select adaptation algorithm and carry out automatic marking,
By at least one keyword abstraction algorithm in integrated CHI, LDA, TEXTRANK, TFIDF, to corpus of text data to be marked into
The pre- mark processing of the processing of pre- mark or the multi-key word fusion of the single keyword of row provides unified keyword models access mark
Standard completes corpus keyword and marks work;After the completion of mark task, keyword models are instructed again using mark corpus
Practice.Model mark effect is assessed by establishing dimensioning algorithm Integrated Evaluation Model, feeds back keyword models learning training,
So that model is reached best effects, promote the accuracy rate of keyword marking model, subsequent newly-increased mark task, by model modification with
Continuous iteration between corpus labeling improves corpus keyword mark quality and algorithm model effect, reduces keyword mark
Error rate.It realizes that card is sentenced in the intervention of annotation results finally by card link is manually sentenced, manually confirms that link marks keyword
Corpus is modified, confirms, is submitted, complete corpus keyword mark work, greatly improved keyword extraction accuracy rate and
Accurate precision;By experiment, it was demonstrated that the keyword marks the validity that training extracting tool is applied to mark keyword corpus.
The present invention simplifies user annotation operating process, supports to mark interface by friendly man-machine interactive, supports external
Importing, training and the use of model.
Detailed description of the invention
Fig. 1 is keyword corpus labeling training extracting tool schematic illustration of the present invention.
Fig. 2 is the keyword models training managing flow chart of Fig. 1.
To make the object, technical solutions and advantages of the present invention clearer, below with reference to embodiment and attached drawing, to this hair
It is bright to be described in further detail.
Specific embodiment
Refering to fig. 1.In preferred embodiment described below, a kind of keyword corpus labeling training extracting tool, comprising:
Keyword corpus labeling preparation module, semi-automatic corpus keyword labeling module, reaction type keyword marking model study instruction
Practice module and keyword marking model recruitment evaluation module, in which: sea of the keyword corpus labeling preparation module to separate sources
Amount corpus data distinguishes, and for the keyword corpus of different purposes, selects keyword corpus source, is set to use
In the corpus to be marked of different purposes, i.e., raw corpus;Semi-automatic corpus keyword labeling module creates keyword mark first
Task, it is autonomous to select adaptation algorithm and carry out to be based on algorithm model further directed to different labeled use demand and corpus feature
Automatic marking, pass through integrated CHI, LDA, the keyword extraction algorithm based on figure sequence, at least one in TEXTRANK, TFIDF
A keyword abstraction algorithm, the pre- mark for carrying out single keyword to corpus of text data to be marked is handled, while can pass through base
It handles, can also be selected simultaneously in the pre- mark that the automatic marking of business rule carries out single keyword to corpus of text data to be marked
It takes a variety of keyword extraction algorithms to carry out keyword mark, and many algorithms annotation results is merged, fused mark
As a result card is sentenced by manually further intervening according to keyword mark traffic criteria, annotation results are saved as into idiom material, are passed through
Keyword corpus labeling preparation module is managed, and for using when dimensioning algorithm model training, provides unified keyword models
Access standard completes corpus keyword and marks work;After the completion of mark task, reaction type keyword marking model learning training
Module is calculated for the internal keyword dimensioning algorithm model integrated and external depth enhancing dimensioning algorithm model by keyword
Method model parameter setting, provide algorithm model study and training, using the keyword corpus marked to keyword mark calculate
Method model carries out re -training, and feedback model, which improves, to be updated, by the continuous iteration between model modification and corpus labeling, automatically
Feedback adjustment completes new keyword mark task;Keyword marking model recruitment evaluation module is according to the assessment for being directed to keyword
Criterion constructs keyword evaluation metrics, completes to quantify to evaluation metrics based on keyword indicator rule, establishes dimensioning algorithm
Integrated Evaluation Model assesses model quantification of targets mark effect automatically, marks task for subsequent key word and recommend automatically
Optimal marking model.
The present embodiment provides corpus of text mark preparation modules to complete to corpus to be marked by sources or theme is managed,
Preparation is provided for mark task;Semi-automatic corpus keyword labeling module is directed to different labeled use demand and corpus feature,
Autonomous selection adaptation algorithm simultaneously carries out automatic marking, realizes that card is sentenced in the intervention of annotation results by manually sentencing card link, specific to walk
It is rapid as follows:
Semi-automatic corpus keyword labeling module creates keyword according to separate sources corpus and marks task;For each category
Infuse task choosing effect adaptation algorithm model, such as can choose in keyword mark task CHI, LDA, TEXTRANK,
The keyword abstractions algorithm such as TFIDF completes automatic marking, and specific dimensioning algorithm can be matched according to corpus automatic marking effect
It sets, semi-automatic corpus keyword labeling module can recommend to default automatically according to keyword marking model recruitment evaluation module results
Dimensioning algorithm;Semi-automatic corpus keyword labeling module creates keyword mark task first, further directed to different marks
Use demand and corpus feature are infused, it is autonomous to select adaptation algorithm and carry out the automatic marking based on algorithm model, by integrated
At least one keyword abstraction algorithm in CHI, LDA, keyword extraction algorithm TEXTRANK, TFIDF based on figure sequence, is treated
Mark corpus of text data carry out the pre- mark processing of single keyword, while can pass through the automatic marking pair based on business rule
Corpus of text data to be marked carry out the pre- mark processing of single keyword.Semi-automatic corpus keyword labeling module is for spy
Different mark task creation business mark rule, and mark business rule is managed, marking business rule here mainly includes
Business dictionary and regular expression for matched character string, such as: time key dates: crucial geographical.Directly by regular expressions
Formula is defined as variable such as reg, and dim reg as expreg is directly set after choosing Microsoft scripting runtime
It sets and dictionary object is defined as variable, dim d as dictionary.The substantially matching process of regular expression is: successively taking
Charactor comparison in expression formula and text out, if each character can match, successful match;Once there is matching unsuccessful
Character then it fails to match.Mark personnel carry out automatic marking to corpus using mark business rule;To based on algorithm model
Automatic marking result and automatic marking result based on business rule carry out fusion treatment, can also choose a variety of keywords simultaneously and mention
It takes algorithm to carry out keyword mark, and many algorithms annotation results is merged, fused annotation results are by manually pressing
Further intervene according to keyword mark traffic criteria and sentence card, modification, confirmation and preservation, annotation results are saved as into idiom material, are led to
It crosses keyword corpus labeling preparation module to be managed, for using when dimensioning algorithm model training, provides unified keyword mould
Type accesses standard and completes the mark work of corpus keyword.
Refering to Fig. 2.Reaction type keyword marking model learning training module is calculated for the internal keyword mark integrated
Method model and external depth enhance dimensioning algorithm model, are arranged by key word algorithm model parameter, provide model learning training.
In keyword models training managing process, what the reading of reaction type keyword marking model learning training module had marked is used to instruct
Experienced corpus selects key algorithm training, for can not training algorithm, no training process simultaneously terminates, using having marked corpus number
According to CHI, LDA, TEXTRANK, TFIDF etc. can training algorithm carry out off-line training, call unified training pattern interface Train,
Keyword models sequential file Kryo is generated, model accuracy is made to reach best.After generating keyword models sequential file Kryo,
Reaction type keyword marking model learning training module judges whether to save keyword models, no, terminates, is then by unified model
Access interface imports external algorithm model, and external algorithm model is updated or is exported, and saving includes algorithm title, model name
Claim, the keyword models file of serializing model file, and more new keywords training pattern table;Using trained model to flat
Model in platform for keyword mark is updated, and completes new keyword mark task.In keyword models update, instead
Feedback formula model learning training module starts Keyword Services, selects the key word algorithm of pre-updated, if the keyword of selection is calculated
Method is not trainable algorithm, then terminates;CHI, LDA, TEXTRANK, TFIDF according to selection etc. can training algorithm, pass through solution
Analyse configuration file in more new keywords switch to determine whether update keyword models, it is no, terminate.It is then according to keyword mould
Type title and keyword training pattern table read designated key word model file, and carry out to the keyword models file of reading
Unserializing completes the load of keyword models, terminates program.
Marking model recruitment evaluation module provides the methods of model evaluation index building mark, building rule, quantification of targets,
It supports to assess model mark effect by constructing dimensioning algorithm Integrated Evaluation Model automatically, the specific steps are as follows: mark
Single index algorithm is arranged according to criterion in modelling effect evaluation module;Index is quantified according to index computation rule,
Dimensioning algorithm Integrated Evaluation Model is constructed using tissue corresponding index according to different labeled task;The integrated value that hits the target calculates,
Marking model effect is fed back.
The quality and evaluation criterion of keyword extraction are at home and abroad there has been no unified evaluation method, because of text data
Selection has biggish subjectivity, so the present invention judges two ways using machine quantitative analysis and human subjective to close
The quality and evaluation criterion that keyword extracts.The index of machine quantitative analysis is most common be still accuracy rate P (Precision),
Recall rate R (Recall), reconcile keyword extraction accuracy rate and recall rate average value F, according to the demand is applied, to keyword
What extraction accuracy rate and recall rate were weighted considers value E, wherein
Accuracy rate and recall rate are commonly referred to as the relationship of inverse ratio.Accuracy rate is improved by certain methods, will lead to recall rate decline, instead
?.In order to define application system for the different demands of accuracy rate and recall rate, can provide a weighted value to its into
Row weighting is considered, so that obtain being weighted keyword extraction accuracy rate and recall rate considers value E:Wherein, b is the weight being added, and b is bigger, then it represents that the weight for considering middle accuracy rate of E value is bigger, on the contrary
Then the weight of recall rate is bigger.
In addition to this, there are two common index reference data value binarypreferencemeasure (Bpref) and right
The mechanism evaluation index MRR (meanreciprocalrank averaged reciprocals ranking) that searching algorithm is evaluated.Reference data value
Bpref is the evaluation metrics for considering collating sequence.For a document, if having R in the keyword extracted at M is standard
Answer, it is therein accurately extract indicated with r, mistake extract indicated with n, then reference data value Bpref passes through following public affairs
Formula calculates:
Searching algorithm evaluation mechanism evaluation index MRR is used to measure the ranking of each first keyword accurately recommended of document
Situation is the evaluation index for being directed to document sets.For a document d, rank is useddTo indicate that first is recommended to close by accurate
The ranking position of keyword, then evaluation index MRR is defined as:
Wherein, D is the collection of document for carrying out keyword abstraction test.
By by sources or theme is managed, providing preparation to corpus to be marked for mark task;By integrated CHI,
The keyword abstractions algorithm such as LDA, TEXTRANK, TFIDF completes the semi-automatic mark of keyword corpus, in annotation process
There is provided applicable dimensioning algorithm may be selected, and carrying out keyword to corpus data to be marked, mark is handled in advance;Finally by artificial true
Recognize link mark corpus is modified, confirms and submitted, completes corpus labeling work.After the completion of mark task, mark is used
It infuses corpus and re -training is carried out to model.Model mark effect is assessed by establishing dimensioning algorithm Integrated Evaluation Model,
Feedback model learning training makes model reach best effects, is used for subsequent newly-increased mark task, passes through model modification and corpus mark
Continuous iteration between note improves corpus labeling quality and algorithm model effect.
The above is present pre-ferred embodiments, it has to be noted that the present invention will be described for above-described embodiment, so
And the present invention is not limited thereto, and those skilled in the art can be designed when being detached from scope of the appended claims
Alternative embodiment.For those skilled in the art, without departing from the spirit and substance in the present invention,
Various changes and modifications can be made therein, these variations and modifications are also considered as protection scope of the present invention.
Claims (10)
1. a kind of keyword corpus labeling training extracting tool, comprising: keyword corpus labeling preparation module, semi-automatic corpus
Keyword labeling module, reaction type keyword marking model learning training module and keyword marking model recruitment evaluation module,
It is characterized by: keyword corpus labeling preparation module distinguishes the magnanimity corpus data of separate sources, for different use
The keyword corpus on way selects keyword corpus source, is set to corpus to be marked for different purposes, i.e., raw language
Material;Semi-automatic corpus keyword labeling module creates keyword mark task first, for different labeled use demand and language
Material feature, it is autonomous to select adaptation algorithm and carry out the automatic marking based on algorithm model, by integrated CHI, LDA, based on figure row
At least one of the keyword extraction algorithm of sequence, TEXTRANK, TFIDF keyword abstraction algorithm, to corpus of text number to be marked
It is handled according to the pre- mark for carrying out single keyword, or chooses above-mentioned a variety of keyword extraction algorithms simultaneously and carry out keyword mark,
And many algorithms annotation results are merged, after the completion of mark task, reaction type keyword marking model learning training mould
Block passes through key word algorithm for the internal keyword dimensioning algorithm model integrated and external depth enhancing dimensioning algorithm model
Model parameter setting, provide algorithm model study and training, using the keyword corpus marked to keyword dimensioning algorithm
Model carries out re -training, and feedback model, which improves, to be updated, automatic anti-by the continuous iteration between model modification and corpus labeling
Feedback adjustment completes new keyword and marks task;Keyword marking model recruitment evaluation module is according to the evaluation index mark of keyword
Quasi- building keyword evaluation metrics are completed to quantify based on keyword indicator rule to evaluation metrics, establish dimensioning algorithm synthesis and comment
Estimate model, model quantification of targets mark effect is assessed automatically, task is marked for subsequent key word and recommends optimal mark automatically
Injection molding type.
2. keyword corpus labeling training extracting tool as described in claim 1, it is characterised in that: fused annotation results
Card is sentenced by manually further intervening according to keyword mark traffic criteria, and annotation results are saved as into idiom material, pass through key
Word corpus labeling preparation module is managed, and for using when dimensioning algorithm model training, provides unified keyword models access
Standard completes corpus keyword and marks work.
3. keyword corpus labeling training extracting tool as described in claim 1, it is characterised in that: semi-automatic corpus is crucial
Word labeling module is managed for special mark task creation business mark rule, and to mark business rule, is marked here
Business rule mainly includes business dictionary and the regular expression for matched character string.
4. keyword corpus labeling training extracting tool as described in claim 1, it is characterised in that: reaction type keyword mark
Model learning training module read marked for training corpus, select key algorithm training, for can not training algorithm,
No training process simultaneously terminates, using marked corpus data to CHI, LDA, TEXTRANK, TFIDF can training algorithm carry out it is offline
Training calls unified training pattern interface Train, generates keyword models sequential file Kryo, reach model accuracy most
It is good.
5. keyword corpus labeling training extracting tool as claimed in claim 3, it is characterised in that: generate keyword models sequence
After column file Kryo, reaction type keyword marking model learning training module judges whether to save keyword models, no, terminates,
It is then to import external algorithm model by unified model access interface, external algorithm model is updated or is exported, preservation includes
Algorithm title, model name, the keyword models file for serializing model file, and more new keywords training pattern table;It uses
Trained model is updated the model that keyword mark is used in platform, completes new keyword mark task.
6. keyword corpus labeling training extracting tool as claimed in claim 4, it is characterised in that: updated in keyword models
In, reaction type model learning training module starts Keyword Services, the key word algorithm of pre-updated is selected, if the key of selection
Word algorithm is not trainable algorithm, then terminates;According to CHI, LDA, TEXTRANK, TFIDF of selection can training algorithm, pass through
The switch of more new keywords in configuration file is parsed to determine whether updating keyword models, no, end, is then according to keyword
Model name and keyword training pattern table, read designated key word model file, and to the keyword models file of reading into
Row unserializing completes the load of keyword models, terminates program.
7. keyword corpus labeling training extracting tool as described in claim 1, it is characterised in that: marking model recruitment evaluation
Single index algorithm is arranged according to criterion in module;Index is quantified according to index computation rule, according to different labeled
Task constructs dimensioning algorithm Integrated Evaluation Model using tissue corresponding index;The integrated value that hits the target calculates, and imitates to marking model
Fruit is fed back.
8. keyword corpus labeling training extracting tool as described in claim 1, it is characterised in that: use machine quantitative analysis
Two ways is judged with human subjective to carry out the quality and evaluation criterion of keyword extraction.
9. keyword corpus labeling training extracting tool as claimed in claim 7, it is characterised in that: the finger of machine quantitative analysis
Mark is accuracy rate P (Precision), recall rate R (Recall), F value, E value, in which:
Accuracy rate
Recall rate
The average value of reconciliation keyword extraction accuracy rate and recall rate。
10. keyword corpus labeling training extracting tool as described in claim 1, it is characterised in that: accuracy rate and recall rate
The commonly referred to as relationship of inverse ratio.Accuracy rate is improved by certain methods, in order to define application system for accuracy rate P and recall rate R
Different demands, provide a weighted value and considered to what its accuracy rate P and recall rate R were weighted, thus obtain recall rate into
Value E is considered in row weighting:
Wherein, b is the weight being added, and b is bigger, then it represents that the weight for considering middle accuracy rate of E value is bigger, on the contrary then recall rate
Weight is bigger.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910455064.3A CN110298033B (en) | 2019-05-29 | 2019-05-29 | Keyword corpus labeling training extraction system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910455064.3A CN110298033B (en) | 2019-05-29 | 2019-05-29 | Keyword corpus labeling training extraction system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110298033A true CN110298033A (en) | 2019-10-01 |
CN110298033B CN110298033B (en) | 2022-07-08 |
Family
ID=68027297
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910455064.3A Active CN110298033B (en) | 2019-05-29 | 2019-05-29 | Keyword corpus labeling training extraction system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110298033B (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110781290A (en) * | 2019-10-10 | 2020-02-11 | 南京摄星智能科技有限公司 | Extraction method of structured text abstract of long chapter |
CN110968695A (en) * | 2019-11-18 | 2020-04-07 | 罗彤 | Intelligent labeling method, device and platform based on active learning of weak supervision technology |
CN111125312A (en) * | 2019-12-24 | 2020-05-08 | 深圳视界信息技术有限公司 | Text labeling method and system |
CN111143577A (en) * | 2019-12-27 | 2020-05-12 | 北京百度网讯科技有限公司 | Data annotation method, device and system |
CN111476034A (en) * | 2020-04-07 | 2020-07-31 | 同方赛威讯信息技术有限公司 | Legal document information extraction method and system based on combination of rules and models |
CN111859854A (en) * | 2020-06-11 | 2020-10-30 | 第四范式(北京)技术有限公司 | Data annotation method, device and equipment and computer readable storage medium |
CN111859872A (en) * | 2020-07-07 | 2020-10-30 | 中国建设银行股份有限公司 | Text labeling method and device |
CN112269877A (en) * | 2020-10-27 | 2021-01-26 | 维沃移动通信有限公司 | Data labeling method and device |
CN112307175A (en) * | 2020-12-02 | 2021-02-02 | 龙马智芯(珠海横琴)科技有限公司 | Text processing method, text processing device, server and computer readable storage medium |
CN112365159A (en) * | 2020-11-11 | 2021-02-12 | 福建亿榕信息技术有限公司 | Deep neural network-based backup cadre recommendation method and system |
CN112395395A (en) * | 2021-01-19 | 2021-02-23 | 平安国际智慧城市科技股份有限公司 | Text keyword extraction method, device, equipment and storage medium |
CN112508376A (en) * | 2020-11-30 | 2021-03-16 | 中国科学院深圳先进技术研究院 | Index system construction method |
CN112632284A (en) * | 2020-12-30 | 2021-04-09 | 上海明略人工智能(集团)有限公司 | Information extraction method and system for unlabeled text data set |
CN112862458A (en) * | 2021-03-02 | 2021-05-28 | 岭东核电有限公司 | Nuclear power test procedure supervision method and device, computer equipment and storage medium |
CN113536783A (en) * | 2021-07-14 | 2021-10-22 | 福建亿榕信息技术有限公司 | Model-based new word discovery method |
CN115511668A (en) * | 2022-10-12 | 2022-12-23 | 金华智扬信息技术有限公司 | Case supervision method, device, equipment and medium based on artificial intelligence |
CN118095251A (en) * | 2024-04-23 | 2024-05-28 | 北京国际大数据交易有限公司 | Offline text data evaluation method and device |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106997344A (en) * | 2017-03-31 | 2017-08-01 | 成都数联铭品科技有限公司 | Keyword abstraction system |
CN108197098A (en) * | 2017-11-22 | 2018-06-22 | 阿里巴巴集团控股有限公司 | A kind of generation of keyword combined strategy and keyword expansion method, apparatus and equipment |
CN108268447A (en) * | 2018-01-22 | 2018-07-10 | 河海大学 | A kind of mask method of Tibetan language name entity |
US20180196870A1 (en) * | 2017-01-12 | 2018-07-12 | Microsoft Technology Licensing, Llc | Systems and methods for a smart search of an electronic document |
CN108595460A (en) * | 2018-01-05 | 2018-09-28 | 中译语通科技股份有限公司 | Multichannel evaluating method and system, the computer program of keyword Automatic |
CN108763213A (en) * | 2018-05-25 | 2018-11-06 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Theme feature text key word extracting method |
CN108960338A (en) * | 2018-07-18 | 2018-12-07 | 苏州科技大学 | The automatic sentence mask method of image based on attention-feedback mechanism |
CN109710728A (en) * | 2018-11-26 | 2019-05-03 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | News topic automatic discovering method |
-
2019
- 2019-05-29 CN CN201910455064.3A patent/CN110298033B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180196870A1 (en) * | 2017-01-12 | 2018-07-12 | Microsoft Technology Licensing, Llc | Systems and methods for a smart search of an electronic document |
CN106997344A (en) * | 2017-03-31 | 2017-08-01 | 成都数联铭品科技有限公司 | Keyword abstraction system |
CN108197098A (en) * | 2017-11-22 | 2018-06-22 | 阿里巴巴集团控股有限公司 | A kind of generation of keyword combined strategy and keyword expansion method, apparatus and equipment |
CN108595460A (en) * | 2018-01-05 | 2018-09-28 | 中译语通科技股份有限公司 | Multichannel evaluating method and system, the computer program of keyword Automatic |
CN108268447A (en) * | 2018-01-22 | 2018-07-10 | 河海大学 | A kind of mask method of Tibetan language name entity |
CN108763213A (en) * | 2018-05-25 | 2018-11-06 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Theme feature text key word extracting method |
CN108960338A (en) * | 2018-07-18 | 2018-12-07 | 苏州科技大学 | The automatic sentence mask method of image based on attention-feedback mechanism |
CN109710728A (en) * | 2018-11-26 | 2019-05-03 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | News topic automatic discovering method |
Non-Patent Citations (5)
Title |
---|
HANGFENG HE等: "A Unified Model for Cross-Domain and Semi-Supervised Named Entity Recognition in Chinese Social Media,Jinseok Nam Semi-Supervised Neural Networks for Nested Named Entity Recognition", 《AAAI》 * |
MATTHEW E. PETERS等: "Semi-supervised sequence tagging with bidirectional language models", 《ARXIV》 * |
冯浩哲等: "面向 3D CT 影像处理的无监督推荐标注算法", 《计算机辅助设计与图形学学报》 * |
刘晓娟等: "国外知识抽取***研究", 《情报科学》 * |
王敏等: "教学视频的文本语义镜头分割和标注", 《数据采集与处理》 * |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110781290A (en) * | 2019-10-10 | 2020-02-11 | 南京摄星智能科技有限公司 | Extraction method of structured text abstract of long chapter |
CN110968695A (en) * | 2019-11-18 | 2020-04-07 | 罗彤 | Intelligent labeling method, device and platform based on active learning of weak supervision technology |
CN111125312A (en) * | 2019-12-24 | 2020-05-08 | 深圳视界信息技术有限公司 | Text labeling method and system |
CN111143577A (en) * | 2019-12-27 | 2020-05-12 | 北京百度网讯科技有限公司 | Data annotation method, device and system |
CN111143577B (en) * | 2019-12-27 | 2023-06-16 | 北京百度网讯科技有限公司 | Data labeling method, device and system |
US11860838B2 (en) | 2019-12-27 | 2024-01-02 | Beijing Baidu Netcom Science And Teciinology Co., Ltd. | Data labeling method, apparatus and system, and computer-readable storage medium |
CN111476034A (en) * | 2020-04-07 | 2020-07-31 | 同方赛威讯信息技术有限公司 | Legal document information extraction method and system based on combination of rules and models |
CN111859854A (en) * | 2020-06-11 | 2020-10-30 | 第四范式(北京)技术有限公司 | Data annotation method, device and equipment and computer readable storage medium |
CN111859872A (en) * | 2020-07-07 | 2020-10-30 | 中国建设银行股份有限公司 | Text labeling method and device |
CN112269877A (en) * | 2020-10-27 | 2021-01-26 | 维沃移动通信有限公司 | Data labeling method and device |
CN112365159A (en) * | 2020-11-11 | 2021-02-12 | 福建亿榕信息技术有限公司 | Deep neural network-based backup cadre recommendation method and system |
CN112508376A (en) * | 2020-11-30 | 2021-03-16 | 中国科学院深圳先进技术研究院 | Index system construction method |
CN112307175A (en) * | 2020-12-02 | 2021-02-02 | 龙马智芯(珠海横琴)科技有限公司 | Text processing method, text processing device, server and computer readable storage medium |
CN112632284A (en) * | 2020-12-30 | 2021-04-09 | 上海明略人工智能(集团)有限公司 | Information extraction method and system for unlabeled text data set |
CN112395395A (en) * | 2021-01-19 | 2021-02-23 | 平安国际智慧城市科技股份有限公司 | Text keyword extraction method, device, equipment and storage medium |
CN112395395B (en) * | 2021-01-19 | 2021-05-28 | 平安国际智慧城市科技股份有限公司 | Text keyword extraction method, device, equipment and storage medium |
CN112862458A (en) * | 2021-03-02 | 2021-05-28 | 岭东核电有限公司 | Nuclear power test procedure supervision method and device, computer equipment and storage medium |
CN113536783A (en) * | 2021-07-14 | 2021-10-22 | 福建亿榕信息技术有限公司 | Model-based new word discovery method |
CN115511668A (en) * | 2022-10-12 | 2022-12-23 | 金华智扬信息技术有限公司 | Case supervision method, device, equipment and medium based on artificial intelligence |
CN115511668B (en) * | 2022-10-12 | 2023-09-08 | 金华智扬信息技术有限公司 | Case supervision method, device, equipment and medium based on artificial intelligence |
CN118095251A (en) * | 2024-04-23 | 2024-05-28 | 北京国际大数据交易有限公司 | Offline text data evaluation method and device |
CN118095251B (en) * | 2024-04-23 | 2024-06-18 | 北京国际大数据交易有限公司 | Offline text data evaluation method and device |
Also Published As
Publication number | Publication date |
---|---|
CN110298033B (en) | 2022-07-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110298033A (en) | Keyword corpus labeling trains extracting tool | |
CN110633409B (en) | Automobile news event extraction method integrating rules and deep learning | |
CN110502621A (en) | Answering method, question and answer system, computer equipment and storage medium | |
US8676815B2 (en) | Suffix tree similarity measure for document clustering | |
CN104216913B (en) | Question answering method, system and computer-readable medium | |
CN110298032A (en) | Text classification corpus labeling training system | |
CN110287481A (en) | Name entity corpus labeling training system | |
CN100595760C (en) | Method for gaining oral vocabulary entry, device and input method system thereof | |
CN105045875B (en) | Personalized search and device | |
CN105528437B (en) | A kind of question answering system construction method extracted based on structured text knowledge | |
CN108717433A (en) | A kind of construction of knowledge base method and device of programming-oriented field question answering system | |
CN110287482B (en) | Semi-automatic participle corpus labeling training device | |
CN103678576A (en) | Full-text retrieval system based on dynamic semantic analysis | |
CN103838833A (en) | Full-text retrieval system based on semantic analysis of relevant words | |
CN103049435A (en) | Text fine granularity sentiment analysis method and text fine granularity sentiment analysis device | |
CN110175585B (en) | Automatic correcting system and method for simple answer questions | |
CN105393263A (en) | Feature completion in computer-human interactive learning | |
CN112307153B (en) | Automatic construction method and device of industrial knowledge base and storage medium | |
CN102184262A (en) | Web-based text classification mining system and web-based text classification mining method | |
CN109145260A (en) | A kind of text information extraction method | |
CN103324700A (en) | Noumenon concept attribute learning method based on Web information | |
CN112051986B (en) | Code search recommendation device and method based on open source knowledge | |
CN110888991A (en) | Sectional semantic annotation method in weak annotation environment | |
CN114090861A (en) | Education field search engine construction method based on knowledge graph | |
CN101556596A (en) | Input method system and intelligent word making method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |