CN102646096A - Linked word searching system and method - Google Patents
Linked word searching system and method Download PDFInfo
- Publication number
- CN102646096A CN102646096A CN2011100407616A CN201110040761A CN102646096A CN 102646096 A CN102646096 A CN 102646096A CN 2011100407616 A CN2011100407616 A CN 2011100407616A CN 201110040761 A CN201110040761 A CN 201110040761A CN 102646096 A CN102646096 A CN 102646096A
- Authority
- CN
- China
- Prior art keywords
- vocabulary
- time
- literary composition
- word
- term set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a linked word searching system and method. The linked word searching method comprises the steps of: adding a time stamp on each file, storing the files with same time stamps in a word matrix; arranging the word matrixes of the time stamps according to the time sequence; extracting all word matrixes in an appointed time interval and adding to obtain a unit time word matrix; calculating a relationship strength among all words in the unit time word matrix to obtain a linked word set; and calculating a time interval between each word and a searched word in the linked word set, removing a word with the time interval exceeding a critical value to obtain a new linked word set, wherein the new linked word set is a keyword for final searching. By using the invention, the linked word of the keyword can be accurately extended.
Description
Technical field
The present invention relates to a kind of related vocabulary search system and method.
Background technology
At present; Degree of correlation between file centralized calculation vocabulary and vocabulary; And then when finding out the relative words of a vocabulary; Nothing more than utilizing vocabulary and vocabulary at same sentence, or the common number of times (co-occurrence number of times) that occurs of same piece of writing article be basic calculation, perhaps uses dictionary (for example wordnet) to inquire about.
But the vocabulary meaning of one's words is difference to some extent along with the time.For example, when people did the relative words expansion with " hadoop " this vocabulary in all file sets, the relative words that can find had " hadoop-0.18 ", " hadoop-0.19 " etc.Undeniable these vocabulary have very big correlation degree with " hadoop " really, but perhaps the user more hopes to know the relative words of this time point instantly, like " hadoop-0.20 ".Or the user wonders that " hadoop " is in situation about developing the year before; At this time as far as the user; Can expect that the relative words that find are irrelevant with nearest 1 year All Files; Calculate but do the related term expansion with file the year before, therefore, at this time " hadoop-0.19 " can be more suitable for doing the relative words that expansion is come out than " hadoop-0.20 ".
Summary of the invention
In view of above content, be necessary to provide a kind of related vocabulary search system, its can be when the related vocabulary of search vocabulary the joining day dimension, expand the related vocabulary of key vocabularies more accurately.
In view of above content, also be necessary to provide a kind of related vocabulary searching method, its can be when the related vocabulary of search vocabulary the joining day dimension, expand the related vocabulary of key vocabularies more accurately.
A kind of related vocabulary search system is applied in the electronic equipment, and this system comprises:
Mark module is used on each file adding a time stamp, and with the file storage of identical time stamp note at a speech literary composition matrix;
Order module is used for the speech literary composition matrix of each time stamp is arranged according to time sequencing;
First computing module is used for that speech literary composition matrixes all in the fixed time section is taken out also addition and obtains a unit interval speech literary composition matrix;
Second computing module is used for calculating the relationship strength between these unit interval all vocabulary of speech literary composition matrix, obtains a related term set; And
The 3rd calculates module, is used for calculating said each vocabulary of related term set and the time interval of inquiring about vocabulary, and the vocabulary that will surpass critical value the time interval removes, and obtains new related term set, and this new related term set is the keyword of final retrieval usefulness.
A kind of related vocabulary searching method runs in the electronic equipment, and this method comprises the steps:
On each file, add a time stamp, and with the file storage of identical time stamp note at a speech literary composition matrix;
The speech literary composition matrix of each time stamp is arranged according to time sequencing;
Speech literary composition matrixes all in the fixed time section is taken out also addition obtain a unit interval speech literary composition matrix;
Calculate the relationship strength between all vocabulary in this unit interval speech literary composition matrix, obtain a related term set; And
Calculate each vocabulary and the time interval of inquiring about vocabulary in the said related term set, the vocabulary that will surpass critical value the time interval removes, and obtains new related term set, and this new related term set is the keyword of final retrieval usefulness.
Preceding method can be carried out by electronic equipment (like computer), and wherein this electronic equipment has display screen, one or more processor, the storer that has attached graphic user interface (GUI) and is kept at one or more modules, program or the instruction set that is used to carry out these methods in the storer.In certain embodiments, this electronic equipment provides the multiple function that comprises radio communication.
The instruction that is used for carrying out preceding method can be included in and be configured to the computer program carried out by one or more processors.
Compared to prior art; Described related vocabulary search system and method; It can be when the related vocabulary of search vocabulary the joining day dimension, expand the related vocabulary of key vocabularies more accurately, improved the efficient that the user uses searching system (like the natural language processing search engine).
Description of drawings
Fig. 1 is the structural representation of electronic equipment of the present invention.
Fig. 2 is the functional block diagram of related vocabulary search system.
Fig. 3 is the process flow diagram of the preferred embodiment of the related vocabulary searching method of the present invention.
Fig. 4 is the synoptic diagram according to the speech literary composition matrix of time sequencing arrangement.
The main element symbol description
Electronic equipment | ?2 |
Display device | ?20 |
Input equipment | ?22 |
Storer | ?23 |
Related vocabulary search system | ?24 |
Processor | ?25 |
Mark module | ?201 |
Order module | ?202 |
First computing module | ?203 |
Second computing module | ?204 |
The 3rd calculates module | ?205 |
Embodiment
As shown in Figure 1, be the structural representation of electronic equipment of the present invention.In the present embodiment, said electronic equipment (like server) 2 comprises display device 20, input equipment 22, storer 23, related vocabulary search system 24 and the processor 25 that links to each other through data bus.Be appreciated that in other embodiments said related vocabulary search system 24 also can be arranged at other calculation element, like PDA (Personal Digital Assistant, personal digital assistant).
Said related vocabulary search system 24 is used for joining day dimension when the related vocabulary of search vocabulary, thereby expands the related vocabulary of key vocabularies more accurately, describes below the detailed process.
Said storer 23 is used to store the data such as program code of said related vocabulary search system 24.Said display device 20 and input equipment 22 are used as the input-output device of electronic equipment 2.
In the present embodiment; Said related vocabulary search system 24 can be divided into one or more modules; Said one or more module is stored in the said storer 23 and is configured to and carried out by one or more processors (present embodiment is a processor 25), to accomplish the present invention.For example, consult shown in Figure 2ly, said related vocabulary search system 24 is divided into mark module 201, order module 202, first computing module 203, second computing module 204 and the 3rd calculates module 205.The alleged module of the present invention is to accomplish the program segment of a specific function, is more suitable in describing the implementation of software in electronic equipment 2 than program.
As shown in Figure 3, be the process flow diagram of the preferred embodiment of the related vocabulary searching method of the present invention.
Step S1, mark module 201 add a time stamp (Time Stamp) on each file, the time dimension during as the related vocabulary of search.In the present embodiment, said time stamp is used for the time of log file generation or the time of the last modification etc.Said file can be to be stored in the storer 23, also can be stored in the far-end server.
Step S2, mark module 201 with the file storage of identical time stamp note at a speech literary composition matrix (Term-Document Matrix, vocabulary-document matrix).
Step S3, order module 202 is arranged the speech literary composition matrix of each time stamp according to time sequencing.Consult shown in Figure 4, M
nSpeech during representative time n literary composition matrix has only been stored the speech literary composition matrix of three time points with the mode of example among Fig. 4.Wherein, the X axle is represented the time (Time), and Y axle representation file (Document), Z axle are represented vocabulary (Term).
Step S4, first computing module 203 take out speech literary composition matrixes all in the fixed time section and addition obtains a unit interval speech literary composition matrix.In the present embodiment, the time section that said fixed time section can be an acquiescence (within 1 year current time) also can be a time section of manually being imported by the user.
Step S5, second computing module 204 calculate the relationship strength between all vocabulary in this unit interval speech literary composition matrix, obtain a related term set.Wherein, The computing method of relationship strength have multiple between vocabulary and the vocabulary; For example, can pass through SVD (Singular Value Decomposition, svd) matrix operation and find out the vector space of file set; And calculate the vector that each vocabulary is represented at this quantity space in the file set, calculate the relationship strength between the vocabulary by the size of vocabulary vector angle.The vector of supposing definition vocabulary i is V
i, the vector of vocabulary j is V
j, then the relationship strength of vocabulary i and vocabulary j is vocabulary vector V
iWith vocabulary vector V
jThe cosine value of angle, its angle cosine value more little or angle is big more, then represents the correlativity of vocabulary i and vocabulary j big more.
Be appreciated that in other embodiments, also can adopt other method to calculate the relationship strength between the vocabulary.For example, the relationship strength of finding out between vocabulary with condition probability model also is to use in the present invention.
Step S6; The 3rd calculates module 205 calculates each vocabulary and the time interval of inquiring about vocabulary in the said related term set; The vocabulary that will surpass critical value the time interval removes, and obtains new related term set, and this new related term set is the keyword of final retrieval usefulness.That is to say, even the degree of correlation that two vocabulary utilizes speech literary composition matrix computations to come out is very high, if but the time section that these two speech occur does not significantly overlap or the time section on do not face mutually, then these two speech are still uncorrelated in this time section.
For example, defining variable Term
iTime={t1, t2 ..., tn} represents vocabulary Term
iTime t1, t2 ... tn has and defining variable Gap occurred
I, jRepresent vocabulary Term
iWith vocabulary Term
jThe time interval.Suppose to have two vocabulary, be respectively A, B, Term
ATime={1,2, on behalf of vocabulary A, 3} appear in the collection of library Term at 1/2/3 these three time points
BTime={10,11, on behalf of vocabulary B, 12} appear in the collection of library at 10/11/12 these three time points, then Gap
A, B=min (| 1-10|, | 2-10|, | 3-10|, | 1-11|, | 2-11|, | 3-11|, | 1-12|, | 2-12|, | 3-12|)=min (9,8,7,10,9,8,11,10,9)=7.Hence one can see that, and the vocabulary A and the vocabulary B time interval apart are 7.
Suppose that critical value (threshold) is 5, because Gap
A, B>5; Therefore, even if through calculating back vocabulary A very high with vocabulary B degree of correlation (this occurs in the normal and vocabulary C of vocabulary A probably and occurs simultaneously, and vocabulary C often and vocabulary B occur simultaneously; But vocabulary A and vocabulary B did not occur simultaneously); The present invention still can think, because the time interval of vocabulary A and vocabulary B surpasses critical value, so vocabulary A is also uncorrelated with vocabulary B.Need to prove: the vocabulary of present embodiment middle finger is meant the vocabulary that filters out behind the common wordss.For example; Related vocabulary search system 24 can often not occur because of " image coding " and " the present invention " simultaneously; And " the present invention " and " etching technique " appearance simultaneously often think that just " image coding " is relevant with " etching technique ", because " the present invention " is common wordss.
What should explain at last is; Above embodiment is only unrestricted in order to technical scheme of the present invention to be described; Although the present invention is specified with reference to preferred embodiment; Those of ordinary skill in the art should be appreciated that and can make amendment or be equal to replacement technical scheme of the present invention, and do not break away from the spirit and the scope of technical scheme of the present invention.
Claims (10)
1. related vocabulary search system is applied to it is characterized in that in the electronic equipment that this system comprises:
Mark module is used on each file adding a time stamp, and with the file storage of identical time stamp note at a speech literary composition matrix;
Order module is used for the speech literary composition matrix of each time stamp is arranged according to time sequencing;
First computing module is used for that speech literary composition matrixes all in the fixed time section is taken out also addition and obtains a unit interval speech literary composition matrix;
Second computing module is used for calculating the relationship strength between these unit interval all vocabulary of speech literary composition matrix, obtains a related term set; And
The 3rd calculates module, is used for calculating said each vocabulary of related term set and the time interval of inquiring about vocabulary, and the vocabulary that will surpass critical value the time interval removes, and obtains new related term set, and this new related term set is the keyword of final retrieval usefulness.
2. related vocabulary search system as claimed in claim 1 is characterized in that, said time stamp is used for time or the last time of revising that log file produces.
3. related vocabulary search system as claimed in claim 1 is characterized in that, said fixed time section is a time section of acquiescence or a time section of manually importing.
4. related vocabulary search system as claimed in claim 1 is characterized in that, said second computing module calculates the relationship strength between the vocabulary according to the size of vocabulary vector angle.
5. related vocabulary search system as claimed in claim 4 is characterized in that said relationship strength is meant the cosine value of vocabulary vector angle.
6. a related vocabulary searching method runs in the electronic equipment, it is characterized in that this method comprises the steps:
On each file, add a time stamp, and with the file storage of identical time stamp note at a speech literary composition matrix;
The speech literary composition matrix of each time stamp is arranged according to time sequencing;
Speech literary composition matrixes all in the fixed time section is taken out also addition obtain a unit interval speech literary composition matrix;
Calculate the relationship strength between all vocabulary in this unit interval speech literary composition matrix, obtain a related term set; And
Calculate each vocabulary and the time interval of inquiring about vocabulary in the said related term set, the vocabulary that will surpass critical value the time interval removes, and obtains new related term set, and this new related term set is the keyword of final retrieval usefulness.
7. related vocabulary searching method as claimed in claim 6 is characterized in that, said time stamp is used for time or the last time of revising that log file produces.
8. related vocabulary searching method as claimed in claim 6 is characterized in that, said fixed time section is a time section of acquiescence or a time section of manually importing.
9. related vocabulary searching method as claimed in claim 6 is characterized in that the relationship strength between the said vocabulary calculates according to the size of vocabulary vector angle.
10. related vocabulary searching method as claimed in claim 9 is characterized in that said relationship strength is meant the cosine value of vocabulary vector angle.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011100407616A CN102646096A (en) | 2011-02-18 | 2011-02-18 | Linked word searching system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011100407616A CN102646096A (en) | 2011-02-18 | 2011-02-18 | Linked word searching system and method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102646096A true CN102646096A (en) | 2012-08-22 |
Family
ID=46658919
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2011100407616A Pending CN102646096A (en) | 2011-02-18 | 2011-02-18 | Linked word searching system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102646096A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015043070A1 (en) * | 2013-09-29 | 2015-04-02 | 北大方正集团有限公司 | Method and system for obtaining a knowledge point implicit relationship |
CN108416019A (en) * | 2018-03-06 | 2018-08-17 | 王海泉 | Conjunctive word method of adjustment and adjustment system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040250277A1 (en) * | 1998-11-23 | 2004-12-09 | Opentv, Inc. | Dynamic event information table schedule window |
CN101281523A (en) * | 2007-04-25 | 2008-10-08 | 北大方正集团有限公司 | Method and device for enquire enquiry extending as well as related searching word stock |
CN101714145A (en) * | 2008-10-07 | 2010-05-26 | 英业达股份有限公司 | Website news analyzing system and method thereof |
-
2011
- 2011-02-18 CN CN2011100407616A patent/CN102646096A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040250277A1 (en) * | 1998-11-23 | 2004-12-09 | Opentv, Inc. | Dynamic event information table schedule window |
CN101281523A (en) * | 2007-04-25 | 2008-10-08 | 北大方正集团有限公司 | Method and device for enquire enquiry extending as well as related searching word stock |
CN101714145A (en) * | 2008-10-07 | 2010-05-26 | 英业达股份有限公司 | Website news analyzing system and method thereof |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015043070A1 (en) * | 2013-09-29 | 2015-04-02 | 北大方正集团有限公司 | Method and system for obtaining a knowledge point implicit relationship |
US10210281B2 (en) | 2013-09-29 | 2019-02-19 | Peking University Founder Group Co., Ltd. | Method and system for obtaining knowledge point implicit relationship |
CN108416019A (en) * | 2018-03-06 | 2018-08-17 | 王海泉 | Conjunctive word method of adjustment and adjustment system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8959043B2 (en) | Fact checking using and aiding probabilistic question answering | |
US9697475B1 (en) | Additive context model for entity resolution | |
US9400835B2 (en) | Weighting metric for visual search of entity-relationship databases | |
US20190005392A1 (en) | Method, device and equipment for fusing different instances describing same entity | |
CN110909550B (en) | Text processing method, text processing device, electronic equipment and readable storage medium | |
CN102567421B (en) | Document retrieval method and device | |
CN110738049B (en) | Similar text processing method and device and computer readable storage medium | |
US10783127B2 (en) | Componentized data storage | |
CN110162637B (en) | Information map construction method, device and equipment | |
CN114861889A (en) | Deep learning model training method, target object detection method and device | |
CN110704608A (en) | Text theme generation method and device and computer equipment | |
CN102999495B (en) | A kind of synonym Semantic mapping relation determines method and device | |
CN102567364A (en) | File search system and method | |
CN102622363A (en) | Associated vocabulary search system and method | |
US10191786B2 (en) | Application program interface mashup generation | |
CN102646096A (en) | Linked word searching system and method | |
CN103294684A (en) | System and method for searching associated terms | |
CN110738048B (en) | Keyword extraction method and device and terminal equipment | |
CN114818736B (en) | Text processing method, chain finger method and device for short text and storage medium | |
王岩 et al. | Density-based distributed clustering method | |
CN104142947A (en) | File classifying system and file classifying method | |
CN110750994A (en) | Entity relationship extraction method and device, electronic equipment and storage medium | |
CN113449062B (en) | Track processing method, track processing device, electronic equipment and storage medium | |
CN111737571B (en) | Searching method and device and electronic equipment | |
EP2469426A1 (en) | Control computer and file search method using the same |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20120822 |