CN102646096A - Linked word searching system and method - Google Patents

Linked word searching system and method Download PDF

Info

Publication number
CN102646096A
CN102646096A CN2011100407616A CN201110040761A CN102646096A CN 102646096 A CN102646096 A CN 102646096A CN 2011100407616 A CN2011100407616 A CN 2011100407616A CN 201110040761 A CN201110040761 A CN 201110040761A CN 102646096 A CN102646096 A CN 102646096A
Authority
CN
China
Prior art keywords
vocabulary
time
literary composition
word
term set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011100407616A
Other languages
Chinese (zh)
Inventor
李忠一
叶建发
蔡程丰
卢俊锜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hongfujin Precision Industry Shenzhen Co Ltd
Hon Hai Precision Industry Co Ltd
Original Assignee
Hongfujin Precision Industry Shenzhen Co Ltd
Hon Hai Precision Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hongfujin Precision Industry Shenzhen Co Ltd, Hon Hai Precision Industry Co Ltd filed Critical Hongfujin Precision Industry Shenzhen Co Ltd
Priority to CN2011100407616A priority Critical patent/CN102646096A/en
Publication of CN102646096A publication Critical patent/CN102646096A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a linked word searching system and method. The linked word searching method comprises the steps of: adding a time stamp on each file, storing the files with same time stamps in a word matrix; arranging the word matrixes of the time stamps according to the time sequence; extracting all word matrixes in an appointed time interval and adding to obtain a unit time word matrix; calculating a relationship strength among all words in the unit time word matrix to obtain a linked word set; and calculating a time interval between each word and a searched word in the linked word set, removing a word with the time interval exceeding a critical value to obtain a new linked word set, wherein the new linked word set is a keyword for final searching. By using the invention, the linked word of the keyword can be accurately extended.

Description

Related vocabulary search system and method
Technical field
The present invention relates to a kind of related vocabulary search system and method.
Background technology
At present; Degree of correlation between file centralized calculation vocabulary and vocabulary; And then when finding out the relative words of a vocabulary; Nothing more than utilizing vocabulary and vocabulary at same sentence, or the common number of times (co-occurrence number of times) that occurs of same piece of writing article be basic calculation, perhaps uses dictionary (for example wordnet) to inquire about.
But the vocabulary meaning of one's words is difference to some extent along with the time.For example, when people did the relative words expansion with " hadoop " this vocabulary in all file sets, the relative words that can find had " hadoop-0.18 ", " hadoop-0.19 " etc.Undeniable these vocabulary have very big correlation degree with " hadoop " really, but perhaps the user more hopes to know the relative words of this time point instantly, like " hadoop-0.20 ".Or the user wonders that " hadoop " is in situation about developing the year before; At this time as far as the user; Can expect that the relative words that find are irrelevant with nearest 1 year All Files; Calculate but do the related term expansion with file the year before, therefore, at this time " hadoop-0.19 " can be more suitable for doing the relative words that expansion is come out than " hadoop-0.20 ".
Summary of the invention
In view of above content, be necessary to provide a kind of related vocabulary search system, its can be when the related vocabulary of search vocabulary the joining day dimension, expand the related vocabulary of key vocabularies more accurately.
In view of above content, also be necessary to provide a kind of related vocabulary searching method, its can be when the related vocabulary of search vocabulary the joining day dimension, expand the related vocabulary of key vocabularies more accurately.
A kind of related vocabulary search system is applied in the electronic equipment, and this system comprises:
Mark module is used on each file adding a time stamp, and with the file storage of identical time stamp note at a speech literary composition matrix;
Order module is used for the speech literary composition matrix of each time stamp is arranged according to time sequencing;
First computing module is used for that speech literary composition matrixes all in the fixed time section is taken out also addition and obtains a unit interval speech literary composition matrix;
Second computing module is used for calculating the relationship strength between these unit interval all vocabulary of speech literary composition matrix, obtains a related term set; And
The 3rd calculates module, is used for calculating said each vocabulary of related term set and the time interval of inquiring about vocabulary, and the vocabulary that will surpass critical value the time interval removes, and obtains new related term set, and this new related term set is the keyword of final retrieval usefulness.
A kind of related vocabulary searching method runs in the electronic equipment, and this method comprises the steps:
On each file, add a time stamp, and with the file storage of identical time stamp note at a speech literary composition matrix;
The speech literary composition matrix of each time stamp is arranged according to time sequencing;
Speech literary composition matrixes all in the fixed time section is taken out also addition obtain a unit interval speech literary composition matrix;
Calculate the relationship strength between all vocabulary in this unit interval speech literary composition matrix, obtain a related term set; And
Calculate each vocabulary and the time interval of inquiring about vocabulary in the said related term set, the vocabulary that will surpass critical value the time interval removes, and obtains new related term set, and this new related term set is the keyword of final retrieval usefulness.
Preceding method can be carried out by electronic equipment (like computer), and wherein this electronic equipment has display screen, one or more processor, the storer that has attached graphic user interface (GUI) and is kept at one or more modules, program or the instruction set that is used to carry out these methods in the storer.In certain embodiments, this electronic equipment provides the multiple function that comprises radio communication.
The instruction that is used for carrying out preceding method can be included in and be configured to the computer program carried out by one or more processors.
Compared to prior art; Described related vocabulary search system and method; It can be when the related vocabulary of search vocabulary the joining day dimension, expand the related vocabulary of key vocabularies more accurately, improved the efficient that the user uses searching system (like the natural language processing search engine).
Description of drawings
Fig. 1 is the structural representation of electronic equipment of the present invention.
Fig. 2 is the functional block diagram of related vocabulary search system.
Fig. 3 is the process flow diagram of the preferred embodiment of the related vocabulary searching method of the present invention.
Fig. 4 is the synoptic diagram according to the speech literary composition matrix of time sequencing arrangement.
The main element symbol description
Electronic equipment ?2
Display device ?20
Input equipment ?22
Storer ?23
Related vocabulary search system ?24
Processor ?25
Mark module ?201
Order module ?202
First computing module ?203
Second computing module ?204
The 3rd calculates module ?205
Embodiment
As shown in Figure 1, be the structural representation of electronic equipment of the present invention.In the present embodiment, said electronic equipment (like server) 2 comprises display device 20, input equipment 22, storer 23, related vocabulary search system 24 and the processor 25 that links to each other through data bus.Be appreciated that in other embodiments said related vocabulary search system 24 also can be arranged at other calculation element, like PDA (Personal Digital Assistant, personal digital assistant).
Said related vocabulary search system 24 is used for joining day dimension when the related vocabulary of search vocabulary, thereby expands the related vocabulary of key vocabularies more accurately, describes below the detailed process.
Said storer 23 is used to store the data such as program code of said related vocabulary search system 24.Said display device 20 and input equipment 22 are used as the input-output device of electronic equipment 2.
In the present embodiment; Said related vocabulary search system 24 can be divided into one or more modules; Said one or more module is stored in the said storer 23 and is configured to and carried out by one or more processors (present embodiment is a processor 25), to accomplish the present invention.For example, consult shown in Figure 2ly, said related vocabulary search system 24 is divided into mark module 201, order module 202, first computing module 203, second computing module 204 and the 3rd calculates module 205.The alleged module of the present invention is to accomplish the program segment of a specific function, is more suitable in describing the implementation of software in electronic equipment 2 than program.
As shown in Figure 3, be the process flow diagram of the preferred embodiment of the related vocabulary searching method of the present invention.
Step S1, mark module 201 add a time stamp (Time Stamp) on each file, the time dimension during as the related vocabulary of search.In the present embodiment, said time stamp is used for the time of log file generation or the time of the last modification etc.Said file can be to be stored in the storer 23, also can be stored in the far-end server.
Step S2, mark module 201 with the file storage of identical time stamp note at a speech literary composition matrix (Term-Document Matrix, vocabulary-document matrix).
Step S3, order module 202 is arranged the speech literary composition matrix of each time stamp according to time sequencing.Consult shown in Figure 4, M nSpeech during representative time n literary composition matrix has only been stored the speech literary composition matrix of three time points with the mode of example among Fig. 4.Wherein, the X axle is represented the time (Time), and Y axle representation file (Document), Z axle are represented vocabulary (Term).
Step S4, first computing module 203 take out speech literary composition matrixes all in the fixed time section and addition obtains a unit interval speech literary composition matrix.In the present embodiment, the time section that said fixed time section can be an acquiescence (within 1 year current time) also can be a time section of manually being imported by the user.
Step S5, second computing module 204 calculate the relationship strength between all vocabulary in this unit interval speech literary composition matrix, obtain a related term set.Wherein, The computing method of relationship strength have multiple between vocabulary and the vocabulary; For example, can pass through SVD (Singular Value Decomposition, svd) matrix operation and find out the vector space of file set; And calculate the vector that each vocabulary is represented at this quantity space in the file set, calculate the relationship strength between the vocabulary by the size of vocabulary vector angle.The vector of supposing definition vocabulary i is V i, the vector of vocabulary j is V j, then the relationship strength of vocabulary i and vocabulary j is vocabulary vector V iWith vocabulary vector V jThe cosine value of angle, its angle cosine value more little or angle is big more, then represents the correlativity of vocabulary i and vocabulary j big more.
Be appreciated that in other embodiments, also can adopt other method to calculate the relationship strength between the vocabulary.For example, the relationship strength of finding out between vocabulary with condition probability model also is to use in the present invention.
Step S6; The 3rd calculates module 205 calculates each vocabulary and the time interval of inquiring about vocabulary in the said related term set; The vocabulary that will surpass critical value the time interval removes, and obtains new related term set, and this new related term set is the keyword of final retrieval usefulness.That is to say, even the degree of correlation that two vocabulary utilizes speech literary composition matrix computations to come out is very high, if but the time section that these two speech occur does not significantly overlap or the time section on do not face mutually, then these two speech are still uncorrelated in this time section.
For example, defining variable Term iTime={t1, t2 ..., tn} represents vocabulary Term iTime t1, t2 ... tn has and defining variable Gap occurred I, jRepresent vocabulary Term iWith vocabulary Term jThe time interval.Suppose to have two vocabulary, be respectively A, B, Term ATime={1,2, on behalf of vocabulary A, 3} appear in the collection of library Term at 1/2/3 these three time points BTime={10,11, on behalf of vocabulary B, 12} appear in the collection of library at 10/11/12 these three time points, then Gap A, B=min (| 1-10|, | 2-10|, | 3-10|, | 1-11|, | 2-11|, | 3-11|, | 1-12|, | 2-12|, | 3-12|)=min (9,8,7,10,9,8,11,10,9)=7.Hence one can see that, and the vocabulary A and the vocabulary B time interval apart are 7.
Suppose that critical value (threshold) is 5, because Gap A, B>5; Therefore, even if through calculating back vocabulary A very high with vocabulary B degree of correlation (this occurs in the normal and vocabulary C of vocabulary A probably and occurs simultaneously, and vocabulary C often and vocabulary B occur simultaneously; But vocabulary A and vocabulary B did not occur simultaneously); The present invention still can think, because the time interval of vocabulary A and vocabulary B surpasses critical value, so vocabulary A is also uncorrelated with vocabulary B.Need to prove: the vocabulary of present embodiment middle finger is meant the vocabulary that filters out behind the common wordss.For example; Related vocabulary search system 24 can often not occur because of " image coding " and " the present invention " simultaneously; And " the present invention " and " etching technique " appearance simultaneously often think that just " image coding " is relevant with " etching technique ", because " the present invention " is common wordss.
What should explain at last is; Above embodiment is only unrestricted in order to technical scheme of the present invention to be described; Although the present invention is specified with reference to preferred embodiment; Those of ordinary skill in the art should be appreciated that and can make amendment or be equal to replacement technical scheme of the present invention, and do not break away from the spirit and the scope of technical scheme of the present invention.

Claims (10)

1. related vocabulary search system is applied to it is characterized in that in the electronic equipment that this system comprises:
Mark module is used on each file adding a time stamp, and with the file storage of identical time stamp note at a speech literary composition matrix;
Order module is used for the speech literary composition matrix of each time stamp is arranged according to time sequencing;
First computing module is used for that speech literary composition matrixes all in the fixed time section is taken out also addition and obtains a unit interval speech literary composition matrix;
Second computing module is used for calculating the relationship strength between these unit interval all vocabulary of speech literary composition matrix, obtains a related term set; And
The 3rd calculates module, is used for calculating said each vocabulary of related term set and the time interval of inquiring about vocabulary, and the vocabulary that will surpass critical value the time interval removes, and obtains new related term set, and this new related term set is the keyword of final retrieval usefulness.
2. related vocabulary search system as claimed in claim 1 is characterized in that, said time stamp is used for time or the last time of revising that log file produces.
3. related vocabulary search system as claimed in claim 1 is characterized in that, said fixed time section is a time section of acquiescence or a time section of manually importing.
4. related vocabulary search system as claimed in claim 1 is characterized in that, said second computing module calculates the relationship strength between the vocabulary according to the size of vocabulary vector angle.
5. related vocabulary search system as claimed in claim 4 is characterized in that said relationship strength is meant the cosine value of vocabulary vector angle.
6. a related vocabulary searching method runs in the electronic equipment, it is characterized in that this method comprises the steps:
On each file, add a time stamp, and with the file storage of identical time stamp note at a speech literary composition matrix;
The speech literary composition matrix of each time stamp is arranged according to time sequencing;
Speech literary composition matrixes all in the fixed time section is taken out also addition obtain a unit interval speech literary composition matrix;
Calculate the relationship strength between all vocabulary in this unit interval speech literary composition matrix, obtain a related term set; And
Calculate each vocabulary and the time interval of inquiring about vocabulary in the said related term set, the vocabulary that will surpass critical value the time interval removes, and obtains new related term set, and this new related term set is the keyword of final retrieval usefulness.
7. related vocabulary searching method as claimed in claim 6 is characterized in that, said time stamp is used for time or the last time of revising that log file produces.
8. related vocabulary searching method as claimed in claim 6 is characterized in that, said fixed time section is a time section of acquiescence or a time section of manually importing.
9. related vocabulary searching method as claimed in claim 6 is characterized in that the relationship strength between the said vocabulary calculates according to the size of vocabulary vector angle.
10. related vocabulary searching method as claimed in claim 9 is characterized in that said relationship strength is meant the cosine value of vocabulary vector angle.
CN2011100407616A 2011-02-18 2011-02-18 Linked word searching system and method Pending CN102646096A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011100407616A CN102646096A (en) 2011-02-18 2011-02-18 Linked word searching system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011100407616A CN102646096A (en) 2011-02-18 2011-02-18 Linked word searching system and method

Publications (1)

Publication Number Publication Date
CN102646096A true CN102646096A (en) 2012-08-22

Family

ID=46658919

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011100407616A Pending CN102646096A (en) 2011-02-18 2011-02-18 Linked word searching system and method

Country Status (1)

Country Link
CN (1) CN102646096A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015043070A1 (en) * 2013-09-29 2015-04-02 北大方正集团有限公司 Method and system for obtaining a knowledge point implicit relationship
CN108416019A (en) * 2018-03-06 2018-08-17 王海泉 Conjunctive word method of adjustment and adjustment system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040250277A1 (en) * 1998-11-23 2004-12-09 Opentv, Inc. Dynamic event information table schedule window
CN101281523A (en) * 2007-04-25 2008-10-08 北大方正集团有限公司 Method and device for enquire enquiry extending as well as related searching word stock
CN101714145A (en) * 2008-10-07 2010-05-26 英业达股份有限公司 Website news analyzing system and method thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040250277A1 (en) * 1998-11-23 2004-12-09 Opentv, Inc. Dynamic event information table schedule window
CN101281523A (en) * 2007-04-25 2008-10-08 北大方正集团有限公司 Method and device for enquire enquiry extending as well as related searching word stock
CN101714145A (en) * 2008-10-07 2010-05-26 英业达股份有限公司 Website news analyzing system and method thereof

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015043070A1 (en) * 2013-09-29 2015-04-02 北大方正集团有限公司 Method and system for obtaining a knowledge point implicit relationship
US10210281B2 (en) 2013-09-29 2019-02-19 Peking University Founder Group Co., Ltd. Method and system for obtaining knowledge point implicit relationship
CN108416019A (en) * 2018-03-06 2018-08-17 王海泉 Conjunctive word method of adjustment and adjustment system

Similar Documents

Publication Publication Date Title
US8959043B2 (en) Fact checking using and aiding probabilistic question answering
US9697475B1 (en) Additive context model for entity resolution
US9400835B2 (en) Weighting metric for visual search of entity-relationship databases
US20190005392A1 (en) Method, device and equipment for fusing different instances describing same entity
CN110909550B (en) Text processing method, text processing device, electronic equipment and readable storage medium
CN102567421B (en) Document retrieval method and device
CN110738049B (en) Similar text processing method and device and computer readable storage medium
US10783127B2 (en) Componentized data storage
CN110162637B (en) Information map construction method, device and equipment
CN114861889A (en) Deep learning model training method, target object detection method and device
CN110704608A (en) Text theme generation method and device and computer equipment
CN102999495B (en) A kind of synonym Semantic mapping relation determines method and device
CN102567364A (en) File search system and method
CN102622363A (en) Associated vocabulary search system and method
US10191786B2 (en) Application program interface mashup generation
CN102646096A (en) Linked word searching system and method
CN103294684A (en) System and method for searching associated terms
CN110738048B (en) Keyword extraction method and device and terminal equipment
CN114818736B (en) Text processing method, chain finger method and device for short text and storage medium
王岩 et al. Density-based distributed clustering method
CN104142947A (en) File classifying system and file classifying method
CN110750994A (en) Entity relationship extraction method and device, electronic equipment and storage medium
CN113449062B (en) Track processing method, track processing device, electronic equipment and storage medium
CN111737571B (en) Searching method and device and electronic equipment
EP2469426A1 (en) Control computer and file search method using the same

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120822