CN109033478A - A kind of text information law analytical method and system for search engine - Google Patents

A kind of text information law analytical method and system for search engine Download PDF

Info

Publication number
CN109033478A
CN109033478A CN201811062638.2A CN201811062638A CN109033478A CN 109033478 A CN109033478 A CN 109033478A CN 201811062638 A CN201811062638 A CN 201811062638A CN 109033478 A CN109033478 A CN 109033478A
Authority
CN
China
Prior art keywords
text
phrase
original document
sample
regularity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811062638.2A
Other languages
Chinese (zh)
Other versions
CN109033478B (en
Inventor
郑燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Industry Polytechnic College
Original Assignee
Chongqing Industry Polytechnic College
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Industry Polytechnic College filed Critical Chongqing Industry Polytechnic College
Priority to CN201811062638.2A priority Critical patent/CN109033478B/en
Publication of CN109033478A publication Critical patent/CN109033478A/en
Application granted granted Critical
Publication of CN109033478B publication Critical patent/CN109033478B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

A kind of text information law analytical method for search engine provided by the present application, comprising: obtain the text of natural language original document;Feature extraction is carried out to the text of the natural language original document, generates Text eigenvector;The text is matched according to the Text eigenvector with the sample in sample database using Vectors matching model trained in advance, obtains target sample;Determine that model according to the text feature consistency between the sample original document and corresponding target sample indexed set of the target sample, determines the semantic regularity of distribution mode of text using the semantic regularity of distribution mode of training in advance;According to the semantic regularity of distribution mode of the text, by the text conversion of the natural language original document at indexed set.Present invention also provides text information law-analysing systems.The application, which is realized, excavates its word regularity of distribution for carrying core semanteme to the original document of natural language, realizes that the high search engine index item of accuracy extracts.

Description

A kind of text information law analytical method and system for search engine
Technical field
This application involves technical field of internet application more particularly to a kind of text information rules for search engine point Analyse method and system.
Background technique
Search engine is the essential tool that people obtain knowledge and information needed for itself in the mass data of internet.It searches Index is held up to be generated by the search need to text information earliest, and it is that search is drawn that text information search at present, which remains on, One of major function held up.
In the search process of text information, the original document in internet is extracted index by index by search engine , index entry is usually several words occurred in original document, and index entry and its link of corresponding original document are stored in Concordance list.In turn, by searcher, according to the index entry of the searching keyword of user match query in index database, so that it is fast Speed detection original document.Searcher also carries out the covariance mapping of original document and searching keyword, to the result that will be exported It is ranked up, search result is shown to user, wherein including the link for being directed toward original document.During above-mentioned search, by It is a relatively complicated process that original document, which extracts index entry,.Because being held in the original document that natural language is write The word for carrying its core semanteme is submerged among a large amount of other word expression, and the word of carrying core semanteme is not necessarily in word frequency The word that (i.e. the frequency of occurrence or ratio of word in a document) is dominant, also lacking in the grammer of natural language clearly to determine The rule of justice or label assist in identifying core semanteme word out, do not carry the word of core semanteme in original document also not Always it is distributed across fixed position, that is to say, that the rule between natural language center innermost thoughts and feelings justice and its original document text information Rule is hiding and polymetabola.Search engine in the prior art mainly utilizes word frequency statistics regular, and combines based on text The weight distribution rule of chapter structure realizes the extraction to index entry in natural language document, thus the extraction of mistake often occurs As a result, the index entry namely extracted does not reflect that the core of original document is semantic, and the word of carrying core semanteme It is leaked through, especially in the short essay shelves of no additional semantic call tag being easier to that above-mentioned mistake occurs.
The mankind are to come by the understandability accumulated in life and verbal learning from one section of document in natural reading activity Middle word of the discovery as its core semanteme carrier, but reappear mankind's reading comprehension with computer there is also very at present Big obstacle.
Artificial intelligence (Artificial Intelligence, AI) is a branch of computer science, it attempts to understand The essence of intelligence, and a kind of new intelligence machine that can be made a response in such a way that human intelligence is similar is produced, the field Research includes robot, language identification, image recognition, natural language processing and expert system etc..Artificial intelligence since the birth, Theory and technology is increasingly mature, and application field also constantly expands.Wherein, in Textual study field, artificial intelligence technology has been answered Many aspects such as semantics recognition, machine translation for natural language.Come from the movable potentiality of human simulation human mind It sees, the developers of search engine generally wish the analysis that this technology is used for text information rule, to facilitate from certainly The original document-of right language is especially the section document without auxiliary informations such as labels-extraction carrying nuclei of origin innermost thoughts and feelings justice index entry.
Summary of the invention
In view of this, the purpose of the application be to propose a kind of text information law analytical method for search engine with System, come solve in the prior art due to carrying document core semanteme word exist rule it is unobvious, do not know caused by Search engine index item extracts the technical problem having difficulties with mistake.
A kind of text information rule for search engine is proposed in the one aspect of the application based on above-mentioned purpose Analysis method, comprising:
Obtain the text of natural language original document;
Feature extraction is carried out to the text of the natural language original document, generates Text eigenvector;
It will be in the text and sample database according to the Text eigenvector using Vectors matching model trained in advance Sample is matched, and obtains target sample, wherein the sample includes sample index collection and sample corresponding with sample index collection This original document;
Determine model according to the sample original document of the target sample using the semantic regularity of distribution mode of training in advance With the text feature consistency between corresponding target sample indexed set, the semantic regularity of distribution mode of text is determined;
According to the semantic regularity of distribution mode of the text, by the text conversion of the natural language original document at index Collection.
In some embodiments, the text to natural language original document carries out feature extraction, generates text feature Vector, comprising:
The phrase in the text is extracted, attributive classification is carried out to the phrase, counts the word frequency of phrase of all categories, according to Phrase classification and the word frequency of phrase of all categories generate Text eigenvector.
In some embodiments, the phrase extracted in the text carries out attributive classification to the phrase, and statistics is each The word frequency of classification phrase, comprising:
The text is segmented, is multiple phrases by the text dividing, each phrase is sorted out, is determined every The attribute classification of a phrase, and word frequency statistics are carried out to the other phrase of each Attribute class.
In some embodiments, each phrase is sorted out, determines the attribute classification of each phrase, specifically includes:
Phrase attributive classification table is constructed, the phrase attributive classification table includes phrase attribute classification and the corresponding category Phrase is semantic, carries out semantics recognition to each phrase, determines the phrase attribute classification of the phrase.
In some embodiments, it is segmented to the text, is multiple phrases by the text dividing, to each word Group carries out after semantics recognition, further includes:
Stop words filtering denoising is carried out to multiple phrases after semantics recognition, filters out making an uproar of including in the multiple phrase Sound phrase.
In some embodiments, it is described using Vectors matching model trained in advance according to the Text eigenvector by institute Text is stated to be matched with the sample in sample database, comprising:
Training neural network model in advance generates Vectors matching model, and utilizes the Vectors matching model, calculates current The Text eigenvector of natural language original document text and the text feature of the sample original document in the sample database The standard deviation of vector, and when the standard deviation is less than preset threshold, successful match, and the sample original document of successful match is made For target sample original document.
In some embodiments, the semantic regularity of distribution mode using training in advance determines model according to the target Text feature consistency between the sample original document of sample and corresponding target sample indexed set determines the semanteme point of text Cloth regular pattern, comprising:
The Text eigenvector for calculating the target sample original document with corresponding target sample indexed set, according to target Sample original document is consistent with the phrase frequency of the similar phrase in the Text eigenvector of corresponding target sample indexed set Property, determine the semantic regularity of distribution mode of text.
A kind of text information rule for search engine is proposed in the another aspect of the application based on above-mentioned purpose Analysis system, comprising:
Text obtains module, for obtaining the text of natural language original document;
Text eigenvector generation module carries out feature extraction to the text of the natural language original document, generates text Eigen vector;
Vectors matching module, for according to the Text eigenvector by the text and sample of the natural language original document Sample in this library is matched, and target sample is obtained;
Semantic regularity of distribution mode decision module, for according to the target sample original document and corresponding target sample Text feature consistency between indexed set determines the semantic regularity of distribution mode of text;
Indexed set generation module, it is for the semantic regularity of distribution mode according to the text, the natural language is original The text conversion of document is at indexed set.
In some embodiments, the Text eigenvector generation module, is specifically used for:
The phrase in the text is extracted, attributive classification is carried out to the phrase, counts the word frequency of each attribute classification phrase, Text eigenvector is generated according to phrase attribute classification and the word frequency of phrase of all categories.
In some embodiments, the semantic regularity of distribution mode decision module, is specifically used for:
The Text eigenvector for calculating the target sample original document with corresponding target sample indexed set, according to target Sample original document is consistent with the phrase frequency of the similar phrase in the Text eigenvector of corresponding target sample indexed set Property, determine semanteme regularity of distribution mode.
A kind of text information law analytical method and system for search engine provided by the embodiments of the present application, to institute The text for stating natural language original document carries out feature extraction, generates Text eigenvector;Utilize Vectors matching trained in advance Model matches the text with the sample in sample database according to the Text eigenvector, obtains target sample, according to Text feature consistency between the sample original document of the target sample and corresponding target sample indexed set, determines text Semantic regularity of distribution mode;According to the semantic regularity of distribution mode of the text, by the text of the natural language original document Originally it is converted into indexed set.The method learnt by artificial intelligence of the embodiment of the present application, to be directed to the original document of natural language Its word regularity of distribution for carrying core semanteme is excavated, realizes that the high search engine index item of accuracy extracts.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:
Fig. 1 is the flow chart of the text information law analytical method for search engine of the embodiment of the present application one;
Fig. 2 is the flow chart of the text information law analytical method for search engine of the embodiment of the present application two;
Fig. 3 is the structural schematic diagram of the text information law-analysing system for search engine of the embodiment of the present application three;
Fig. 4 is the text information law-analysing for search engine using the embodiment of the present application of the embodiment of the present application four The flow diagram of the generation indexed set of system.
Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Convenient for description, part relevant to related invention is illustrated only in attached drawing.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
As one embodiment of the application, as shown in Figure 1, being the text for search engine of the embodiment of the present application one The flow chart of information law analytical method.It can be seen from the figure that the text information provided in this embodiment for search engine Law analytical method, comprising the following steps:
S101: the text of natural language original document is obtained.
In the present embodiment, the text of natural language original document can be manually entered, and it is automatic to be also possible to system It obtains.Natural language original document in the present embodiment and following embodiment refers to one section of text, such as " photochromic is light One kind is that unit of account indicates the numerical value of light color with K (kevin) in, generally touched in life it is photochromic be 2700K~ 6500K, industrial lighting and special dimension (such as automotive lighting) will use the light source illumination photochromic more than 7000K ", or " high speed Highway designates the travel speed in lane, and max. speed must not exceed 120 kilometers per hour, and minimum speed must not be lower than per hour 60 kilometers, the small-sized passenger car max. speed of running on expressway must not exceed 120 kilometers per hour, other motor vehicles are not Obtaining is more than 100 kilometers per hour, and motorcycle must not exceed 80 kilometers per hour ".Search engine can from webpage, e-book, The text of the natural language original document of magnanimity grade is searched for and collected in the initial data such as paper
S102: feature extraction is carried out to the text of the natural language original document, generates Text eigenvector.
In the present embodiment, after getting the text of natural language original document, feature can be carried out to the text It extracts, to generate Text eigenvector.Specifically, the text can be divided into multiple phrases, and then can be by going to deactivate Word processing removal wherein without the phrase of practical significance, is referred to common deactivated vocabulary implementation and stop words is gone to handle;Remove stop words It is that denoising is filtered to the resulting multiple phrases of participle, filters out the noise phrase for including in the multiple phrase;Due to described Text may include conjunctive word and adverbial word, and during carrying out semantics recognition to the text, this kind of phrase is without practical Meaning therefore denoising can be filtered to multiple phrases after semantics recognition, by conjunctive word and adverbial word etc. without practical meaning The phrase of think of filters out, and can mitigate the workload of machine significantly.
Then, the phrase remained is sorted out, phrase is classified as to the classification of predefined type, then with each Classification is the phrase quantity that unit counts each classification in word frequency, that is, original document;According to the classification of phrase and accordingly Phrase quantity in classification generates Text eigenvector.Still with " highway designates the travel speed in lane, and max. speed is not Obtaining is more than 120 kilometers per hour, and minimum speed must not be lower than 60 kilometers per hour, the small-sized passenger car of running on expressway Max. speed must not exceed 120 kilometers per hour, other motor vehicles must not exceed 100 kilometers per hour, and motorcycle must not exceed 80 kilometers per hour " for be illustrated, in this example, the classification of phrase may include: concept phrase sum number quantifier group, tool Body, the phrase in concept phrase includes " small-sized passenger car ", " other motor vehicles " and " motorcycle ", the phrase of quantity phrase Including " per hour 120 kilometers ", " per hour 100 kilometers ", " per hour 80 kilometers " and " per hour 60 kilometers ".
Classification for phrase above-mentioned can establish phrase classification concordance list, record in phrase classification concordance list The corresponding common phrase of each classification will be from natural language original document text by calling corresponding phrase classification concordance list The middle phrase for extracting and stop words being gone to retain later corresponds to the concordance list and is included into phrase classification.
In turn, using the word frequency (phrase quantity) of the phrase classification of statistics and each classification, by the original text of this natural language The corresponding Text eigenvector of text generation of shelves, is expressed as { (S1, N1), (S2, N2) ... (Sn, Nn) }, wherein S1, S2 ... Sn Concept phrase, quantity phrase for phrase classification, such as above etc.;N1, N2 ... Nn are the word frequency of each phrase classification, also It is the quantity for the phrase being included under the category;For example, material text above-mentioned, the Text eigenvector extracted be should be { (concept phrase, 3), (quantity phrase, 4) }, wherein number 3,4 indicates word frequency.
S103: using Vectors matching model trained in advance according to the Text eigenvector by the text and sample database In sample matched, obtain target sample, wherein the sample includes sample index collection and corresponding with sample index collection Sample original document.
In the present embodiment, it after generating the Text eigenvector of text of natural language original document, can use Vectors matching model matches text feature vector with the sample in sample database.Sample in sample database includes a large amount of Sample index collection and sample original document corresponding with sample index collection.Specifically, the Vectors matching model is a warp The neural network model for being learnt to a large amount of sample in sample database and being generated is crossed, so that the Vectors matching model is defeated Enter be natural language original document text under the premise of, output be with the natural language original document text similarity of input compared with High sample original document, similarity here refer to the similarity between the Text eigenvector of text, the class including phrase The similarity of phrase quantity between other similarity and similar phrase.
Vectors matching model is as training neural network model in advance, when the text for inputting current natural language original document After feature vector, can calculate and export the Text eigenvector of current natural language original document with it is every in the sample database The standard deviation of the Text eigenvector of a sample original document, and when the standard deviation is less than preset threshold, successful match, and will The sample original document of successful match is as target sample original document.Specifically, if the original document of natural language Text eigenvector is { (S1, N1), (S2, N2) ... (Sn, Nn) }, and the Text eigenvector of sample original document text (S1, N1 '), (S2, N2 ') ... (Sn, Nn ') }, then the standard deviation of two Text eigenvectors is expressed asThink successful match if ε is less than threshold value, the target sample original document and it is current from Right language original document is corresponding.
S104: model is determined using the semantic regularity of distribution mode of training in advance, according to the target sample original document With the text feature consistency between corresponding target sample indexed set, the semantic regularity of distribution mode of text is determined.
In the present embodiment, the corresponding target of the natural language original document text is being determined using Vectors matching model It, can be according to the text feature one between sample original document and corresponding target sample indexed set after sample original document Cause property, to determine phrase classification involved in the index terms in indexed set, and then can be according to the word of target sample indexed set Group classification determines phrase classification involved in the indexed set of the original document of natural language.
Specifically, the semantic regularity of distribution mode in the present embodiment determines that model is one by a large amount of in sample database Sample learnt and the neural network model that generates, by sample index collection a large amount of in sample database and sample index collection Corresponding sample original document is learnt, so that the semanteme regularity of distribution mode determines that model can determine the sample of input The consistency of the Text eigenvector of the text of indexed set and corresponding sample original document, and determined and indexed according to the consistency Phrase classification involved in the index terms of concentration.Specifically, the semantic regularity of distribution mode determines that model calculates the sample The Text eigenvector of this original document and corresponding sample index collection, according to target sample original document and corresponding target sample The phrase frequency of similar phrase in the Text eigenvector of this indexed set determines the phrase all having in the two compared with high word frequency Type is phrase classification involved in indexed set.
By taking following example as an example, sample original document be text " it is photochromic be in optics one kind with K (kevin) be calculate Unit indicates the numerical value of light color, and what is generally touched in life is photochromic for 2700K~6500K, industrial lighting and special dimension (such as automotive lighting) will use the light source illumination photochromic more than 7000K ", the phrase classification of the sample original document includes notional word Group and numeral-classifier compound group, wherein " photochromic " extracted, " optics ", " illumination ", " light source " belong to concept phrase, " 2700K ", " 6500K ", " 7000K " belong to quantity phrase, and Text eigenvector is { (concept phrase, 4), (quantity phrase, 3) }, corresponding The index terms that sample index collection includes is " photochromic ", " light source ", " optics ", and the Text eigenvector of sample index collection can be { (concept phrase, 3), (quantity phrase, 0) }, then the consistency of two Text eigenvectors is the word in concept phrase dimension Frequency is all higher, accordingly, it is determined that phrase classification involved in indexed set is concept phrase.Indexed set
S105: according to the semantic regularity of distribution mode of the text, the text of the original document of the natural language is turned Change indexed set into.
The sample original document of sample in current natural language original document text and sample database is obtained in step 103 Text eigenvector similarity, the determining and current most matched sample original document of natural language original document text, in turn According to the consistency between the sample original document and sample index collection, determine the phrase classification that indexed set is related to, then it can be with Same text semantic regularity of distribution mode chooses same category of phrase in current natural language original document, as current original The text conversion of natural language original document is indexed set by the indexed set of beginning document.
The text information law analytical method for search engine of the embodiment of the present application, to the original text of the natural language The text of shelves carries out feature extraction, and then is matched according to the Text eigenvector with the sample in sample database, and mesh is obtained Standard specimen sheet determines model using the semantic regularity of distribution mode of training in advance, according to the sample original document of the target sample With the text feature consistency between corresponding target sample indexed set, the semantic regularity of distribution mode of text is determined, further according to Semantic regularity of distribution mode by the text conversion of natural language original document at indexed set, to pass through the machine learning to sample Solve the problems, such as that the indexed set that the short text-without index is especially to the urtext-of natural language extracts, it can be for certainly The original document of right language excavates its word regularity of distribution for carrying core semanteme, realizes the high search engine index item of accuracy It extracts.
As shown in Fig. 2, being the process of the text information law analytical method for search engine of the embodiment of the present application two Figure.As the specific embodiment of the application, the above-mentioned text information law analytical method for search engine, including it is following Step:
S201: the text of natural language original document is obtained.
In the present embodiment, the text of natural language original document can be search engine from webpage, e-book, paper Etc. search for and collect in initial data magnanimity grade natural language original document text.Embodiment one specifically is referred to, here It repeats no more.
S202: segmenting the text, is multiple phrases by the text dividing, carries out semantic knowledge to each phrase Not, it determines the attribute classification of each phrase, and the other phrase of same Attribute class is sorted out.
It can be multiple phrases by above-mentioned text dividing, and according to each phrase after being segmented to above-mentioned text The meaning of a word carries out semantics recognition to each phrase, determines the attribute classification of each phrase, and carry out to the other phrase of same Attribute class Sort out.Specifically, phrase attributive classification table can be constructed, the phrase attributive classification table includes phrase attribute classification and correspondence The phrase of the category is semantic, carries out semantics recognition to each phrase, determines the phrase attribute classification of the phrase.
S203: counting the phrase frequency in the phrase attribute classification, according to phrase attribute classification and each attribute classification The word frequency of phrase generates Text eigenvector.
S204: using Vectors matching model trained in advance according to the Text eigenvector by the text and sample database In sample matched, obtain target sample, wherein the sample includes sample index collection and corresponding with sample index collection Sample original document.
S205: model is determined using the semantic regularity of distribution mode of training in advance, according to the target sample original document With the text feature consistency between corresponding target sample indexed set, the semantic regularity of distribution mode of text is determined.
S206: according to the semantic regularity of distribution mode of the text, the text of the original document of the natural language is turned Change indexed set into.
The present embodiment can obtain the technical effect similar with above-described embodiment, and which is not described herein again.
As shown in figure 3, being the structure of the text information law-analysing system for search engine of the embodiment of the present application three Schematic diagram.Text information law-analysing system provided in this embodiment for search engine, comprising:
Text obtains module 301, for obtaining the text of natural language original document.
Text eigenvector generation module 302 carries out feature extraction to the text, generates Text eigenvector;
Vectors matching module 303, for according to the Text eigenvector by the sample in the text and sample database into Row matching, obtains target sample, wherein the sample includes that sample index collection and sample corresponding with sample index collection are original Document;
Semantic regularity of distribution mode decision module 304, for according to the target sample original document and corresponding target Text feature consistency between sample index collection determines the semantic regularity of distribution mode of text;
Indexed set generation module 305, it is for the semantic regularity of distribution mode according to the text, the natural language is former The text conversion of beginning document is at indexed set.
Further, the Text eigenvector generation module 302, is specifically used for:
The phrase in the text is extracted, attributive classification is carried out to the phrase, counts the word frequency of each attribute classification phrase, Text eigenvector is generated according to phrase attribute classification and the word frequency of phrase of all categories.
The semanteme regularity of distribution mode decision module 304, is specifically used for:
The Text eigenvector for calculating the target sample original document with corresponding target sample indexed set, according to target Sample original document is consistent with the phrase frequency of the similar phrase in the Text eigenvector of corresponding target sample indexed set Property, determine semanteme regularity of distribution mode.
The text information law-analysing system for search engine of the present embodiment can obtain and above method embodiment Similar technical effect, which is not described herein again.
As shown in figure 4, being the text information for search engine using the embodiment of the present application of the embodiment of the present application four The flow diagram of law-analysing system realizing indexed set and generating.Figure 4, it is seen that when utilizing the embodiment of the present application When text information law-analysing system for search engine generates the indexed set of search engine, it is original natural language can be inputted Document text.The natural language original document text is got in the text information law-analysing system for search engine After this, the Text eigenvector of the natural language original document text is generated by Text eigenvector generation module, and will The Text eigenvector is sent to Vectors matching module, and in the present embodiment, the Vectors matching module is an instruction in advance Practice neural network model, after inputting the Text eigenvector of current natural language original document, can calculate and export current The Text eigenvector of each sample original document in the Text eigenvector of natural language original document and the sample database Standard deviation, and when the standard deviation is less than preset threshold, successful match, and using the sample original document of successful match as mesh This original document of standard specimen.Specifically, a large amount of sample original document having in sample database can be advanced with to neural network Model carries out learning training, to generate the Vectors matching module, so that natural language of the Vectors matching module according to input The Text eigenvector of speech original document text is matched with the Text eigenvector of the sample original document in sample database.By In the Text eigenvector include the type of the phrase in text and the quantity of similar phrase, therefore, in the vector With module by the text of natural language original document and sample original document carry out it is matched during, natural language can be based on The text of original document is matched with the quantity of the phrase that sample original document includes and corresponding phrase, is being obtained and nature It is original according to sample by semantic regularity of distribution mode decision module after the corresponding sample original document of the text of language original document The text feature consistency of document and the corresponding sample index collection of the sample original document, determines text semantic regularity of distribution mould Formula.Specifically, the text semantic regularity of distribution mode decision module is according to the sample original document and corresponding sample of input The Text eigenvector of indexed set determines the consistency of the phrase frequency of the similar phrase in the Text eigenvector of the two, really Attribute justice regularity of distribution mode.Indexed set generation module, for the semantic regularity of distribution mode according to the text, by described in certainly Generic phrase is extracted in the text of right language original document, is converted into indexed set.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from foregoing invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein Can technical characteristic replaced mutually and the technical solution that is formed.

Claims (10)

1. a kind of text information law analytical method for search engine characterized by comprising
Obtain the text of natural language original document;
Feature extraction is carried out to the text of the natural language original document, generates Text eigenvector;
Using Vectors matching model trained in advance according to the Text eigenvector by the sample in the text and sample database It is matched, obtains target sample, wherein the sample includes that sample index collection and sample corresponding with sample index collection are former Beginning document;
Using the semantic regularity of distribution mode of training in advance determine model according to the sample original document of the target sample with it is right The text feature consistency between target sample indexed set answered determines the semantic regularity of distribution mode of text;
According to the semantic regularity of distribution mode of the text, by the text conversion of the natural language original document at indexed set.
2. the text information law analytical method according to claim 1 for search engine, which is characterized in that described right The text of natural language original document carries out feature extraction, generates Text eigenvector, comprising:
The phrase in the text is extracted, attributive classification is carried out to the phrase, the word frequency of phrase of all categories is counted, according to phrase Classification and the word frequency of phrase of all categories generate Text eigenvector.
3. the text information law analytical method according to claim 2 for search engine, which is characterized in that described to mention The phrase in the text is taken, attributive classification is carried out to the phrase, counts the word frequency of phrase of all categories, comprising:
The text is segmented, is multiple phrases by the text dividing, each phrase is sorted out, determines each word The attribute classification of group, and word frequency statistics are carried out to the other phrase of each Attribute class.
4. the text information law analytical method according to claim 3 for search engine, which is characterized in that each Phrase is sorted out, and determines the attribute classification of each phrase, specifically includes:
Phrase attributive classification table is constructed, the phrase attributive classification table includes the phrase of phrase attribute classification and the corresponding category Semanteme carries out semantics recognition to each phrase, determines the phrase attribute classification of the phrase.
5. the text information law analytical method according to claim 4 for search engine, which is characterized in that institute It states text to be segmented, is multiple phrases by the text dividing, after each phrase progress semantics recognition, further includes:
Stop words filtering denoising is carried out to multiple phrases after semantics recognition, filters out the noise word for including in the multiple phrase Group.
6. the text information law analytical method according to claim 5 for search engine, which is characterized in that the benefit The sample in the text and sample database is carried out according to the Text eigenvector with Vectors matching model trained in advance Match, comprising:
Training neural network model in advance generates Vectors matching model, and utilizes the Vectors matching model, calculates current natural The Text eigenvector of language original document text and the Text eigenvector of the sample original document in the sample database Standard deviation, and when the standard deviation is less than preset threshold, successful match, and using the sample original document of successful match as mesh This original document of standard specimen.
7. the text information law analytical method according to claim 6 for search engine, which is characterized in that the benefit Determine model according to the sample original document of the target sample and corresponding mesh with the semantic regularity of distribution mode of training in advance Text feature consistency between this indexed set of standard specimen determines the semantic regularity of distribution mode of text, comprising:
The Text eigenvector for calculating the target sample original document with corresponding target sample indexed set, according to target sample The consistency of original document and the phrase frequency of the similar phrase in the Text eigenvector of corresponding target sample indexed set, really Determine the semantic regularity of distribution mode of text.
8. a kind of text information law-analysing system for search engine characterized by comprising
Text obtains module, for obtaining the text of natural language original document;
Text eigenvector generation module carries out feature extraction to the text of the natural language original document, it is special to generate text Levy vector;
Vectors matching module, for according to the Text eigenvector by the text and sample database of the natural language original document In sample matched, obtain target sample;
Semantic regularity of distribution mode decision module, for being indexed according to the target sample original document with corresponding target sample Text feature consistency between collection determines the semantic regularity of distribution mode of text;
Indexed set generation module, for the semantic regularity of distribution mode according to the text, by the natural language original document Text conversion at indexed set.
9. the text information law-analysing system according to claim 8 for search engine, which is characterized in that the text Eigen vector generation module is used for:
The phrase in the text is extracted, attributive classification is carried out to the phrase, counts the word frequency of each attribute classification phrase, according to Phrase attribute classification and the word frequency of phrase of all categories generate Text eigenvector.
10. the text information law-analysing system according to claim 9 for search engine, which is characterized in that described Semantic regularity of distribution mode decision module, is specifically used for:
The Text eigenvector for calculating the target sample original document with corresponding target sample indexed set, according to target sample The consistency of original document and the phrase frequency of the similar phrase in the Text eigenvector of corresponding target sample indexed set, really Attribute justice regularity of distribution mode.
CN201811062638.2A 2018-09-12 2018-09-12 Text information rule analysis method and system for search engine Active CN109033478B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811062638.2A CN109033478B (en) 2018-09-12 2018-09-12 Text information rule analysis method and system for search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811062638.2A CN109033478B (en) 2018-09-12 2018-09-12 Text information rule analysis method and system for search engine

Publications (2)

Publication Number Publication Date
CN109033478A true CN109033478A (en) 2018-12-18
CN109033478B CN109033478B (en) 2022-08-19

Family

ID=64621773

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811062638.2A Active CN109033478B (en) 2018-09-12 2018-09-12 Text information rule analysis method and system for search engine

Country Status (1)

Country Link
CN (1) CN109033478B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110705251A (en) * 2019-10-14 2020-01-17 支付宝(杭州)信息技术有限公司 Text analysis method and device executed by computer
CN111160568A (en) * 2019-12-27 2020-05-15 北京百度网讯科技有限公司 Machine reading understanding model training method and device, electronic equipment and storage medium
CN111782808A (en) * 2020-06-29 2020-10-16 北京市商汤科技开发有限公司 Document processing method, device, equipment and computer readable storage medium
CN112115892A (en) * 2020-09-24 2020-12-22 科大讯飞股份有限公司 Key element extraction method, device, equipment and storage medium
CN113935329A (en) * 2021-10-13 2022-01-14 昆明理工大学 Asymmetric text matching method based on adaptive feature recognition and denoising
CN115374239A (en) * 2022-07-13 2022-11-22 北京中海住梦科技有限公司 Legal and legal analysis method and device, computer equipment and readable storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070106499A1 (en) * 2005-08-09 2007-05-10 Kathleen Dahlgren Natural language search system
CN102200975A (en) * 2010-03-25 2011-09-28 北京师范大学 Vertical search engine system and method using semantic analysis
CN103106262A (en) * 2013-01-28 2013-05-15 新浪网技术(中国)有限公司 Method and device of file classification and generation of support vector machine model
CN103186662A (en) * 2012-12-28 2013-07-03 中联竞成(北京)科技有限公司 System and method for extracting dynamic public sentiment keywords
CN103838833A (en) * 2014-02-24 2014-06-04 华中师范大学 Full-text retrieval system based on semantic analysis of relevant words
US20160147878A1 (en) * 2014-11-21 2016-05-26 Inbenta Professional Services, L.C. Semantic search engine
CN107402960A (en) * 2017-06-15 2017-11-28 成都优易数据有限公司 A kind of inverted index optimized algorithm based on the weighting of the semantic tone
CN107491518A (en) * 2017-08-15 2017-12-19 北京百度网讯科技有限公司 Method and apparatus, server, storage medium are recalled in one kind search

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070106499A1 (en) * 2005-08-09 2007-05-10 Kathleen Dahlgren Natural language search system
CN102200975A (en) * 2010-03-25 2011-09-28 北京师范大学 Vertical search engine system and method using semantic analysis
CN103186662A (en) * 2012-12-28 2013-07-03 中联竞成(北京)科技有限公司 System and method for extracting dynamic public sentiment keywords
CN103106262A (en) * 2013-01-28 2013-05-15 新浪网技术(中国)有限公司 Method and device of file classification and generation of support vector machine model
CN103838833A (en) * 2014-02-24 2014-06-04 华中师范大学 Full-text retrieval system based on semantic analysis of relevant words
US20160147878A1 (en) * 2014-11-21 2016-05-26 Inbenta Professional Services, L.C. Semantic search engine
CN107402960A (en) * 2017-06-15 2017-11-28 成都优易数据有限公司 A kind of inverted index optimized algorithm based on the weighting of the semantic tone
CN107491518A (en) * 2017-08-15 2017-12-19 北京百度网讯科技有限公司 Method and apparatus, server, storage medium are recalled in one kind search

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
樊重俊等: "《分布估计算法及其引用》", 31 January 2016, 国防工业出版社 *
许鑫: "《基于文本特征计算的信息分析方法》", 30 November 2015, 上海科学技术文献出版社 *
邵欣等: "《物联网技术及应用》", 30 June 2018, 北京航空航天大学出版社 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110705251A (en) * 2019-10-14 2020-01-17 支付宝(杭州)信息技术有限公司 Text analysis method and device executed by computer
CN110705251B (en) * 2019-10-14 2023-06-16 支付宝(杭州)信息技术有限公司 Text analysis method and device executed by computer
CN111160568A (en) * 2019-12-27 2020-05-15 北京百度网讯科技有限公司 Machine reading understanding model training method and device, electronic equipment and storage medium
US11410084B2 (en) 2019-12-27 2022-08-09 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for training machine reading comprehension model, and storage medium
CN111782808A (en) * 2020-06-29 2020-10-16 北京市商汤科技开发有限公司 Document processing method, device, equipment and computer readable storage medium
CN112115892A (en) * 2020-09-24 2020-12-22 科大讯飞股份有限公司 Key element extraction method, device, equipment and storage medium
CN113935329A (en) * 2021-10-13 2022-01-14 昆明理工大学 Asymmetric text matching method based on adaptive feature recognition and denoising
CN115374239A (en) * 2022-07-13 2022-11-22 北京中海住梦科技有限公司 Legal and legal analysis method and device, computer equipment and readable storage medium

Also Published As

Publication number Publication date
CN109033478B (en) 2022-08-19

Similar Documents

Publication Publication Date Title
CN109033478A (en) A kind of text information law analytical method and system for search engine
CN108536870B (en) Text emotion classification method fusing emotional features and semantic features
CN106649715B (en) A kind of cross-media retrieval method based on local sensitivity hash algorithm and neural network
US10754883B1 (en) System and method for insight automation from social data
US20220292123A1 (en) Method and Device for Pre-Selecting and Determining Similar Documents
CN107797987B (en) Bi-LSTM-CNN-based mixed corpus named entity identification method
CN110188349A (en) A kind of automation writing method based on extraction-type multiple file summarization method
CN106294344A (en) Video retrieval method and device
CN109241534B (en) Examination question automatic generation method and device based on text AI learning
CN110209818B (en) Semantic sensitive word and sentence oriented analysis method
CN112633011B (en) Research front edge identification method and device for fusing word semantics and word co-occurrence information
CN112434164B (en) Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration
CN110956044A (en) Attention mechanism-based case input recognition and classification method for judicial scenes
CN112597285A (en) Man-machine interaction method and system based on knowledge graph
US20130052619A1 (en) Method for building information on emotion lexicon and apparatus for the same
Monisha et al. Classification of bengali questions towards a factoid question answering system
CN115713072A (en) Relation category inference system and method based on prompt learning and context awareness
CN113127607A (en) Text data labeling method and device, electronic equipment and readable storage medium
CN117743593A (en) Knowledge-graph-based equipment online auxiliary maintenance method and system
Preetham et al. Comparative Analysis of Research Papers Categorization using LDA and NMF Approaches
Hassani et al. Disambiguating spatial prepositions using deep convolutional networks
CN113377957B (en) National economy industry classification method and system based on knowledge graph
CN105975480A (en) Instruction identification method and system
CN114969341A (en) Fine-grained emotion analysis method and device for catering industry comments
Hayat et al. Self learning of news category using ai techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant