CN104484381A - Method and system for searching multiple strings - Google Patents

Method and system for searching multiple strings Download PDF

Info

Publication number
CN104484381A
CN104484381A CN201410757944.3A CN201410757944A CN104484381A CN 104484381 A CN104484381 A CN 104484381A CN 201410757944 A CN201410757944 A CN 201410757944A CN 104484381 A CN104484381 A CN 104484381A
Authority
CN
China
Prior art keywords
string
word
substring
text
pattern
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410757944.3A
Other languages
Chinese (zh)
Other versions
CN104484381B (en
Inventor
张�林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
eBay Inc
Original Assignee
eBay Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by eBay Inc filed Critical eBay Inc
Priority to CN201410757944.3A priority Critical patent/CN104484381B/en
Priority claimed from CN201010116709.XA external-priority patent/CN102169485B/en
Publication of CN104484381A publication Critical patent/CN104484381A/en
Application granted granted Critical
Publication of CN104484381B publication Critical patent/CN104484381B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method and a system for searching multiple strings. The method can comprise the following steps of storing image sample strings which are respectively started by first characters; for each different first character, storing the first character and assemblies with different string lengths of image sample strings which are respectively started by the first characters; in a text, identifying a character matched with one of first image sample characters; iteratively extracting strings from the text on the basis of the strings which are started by the identified characters, storing the string length equal to one of the string lengths and the matched first character together; iteratively comparing each extracted string and each image sample string with the same first character and string length as the sub string; based on the comparison, determining that at least one extracted string is matched with one of the image sample strings.

Description

For searching for the method and system of multiple string
The divisional application of the application's to be original bill application number be application for a patent for invention (applying date: on February 26th, 2010, denomination of invention: for searching for the method and system of multiple string) of 201010116709.X.
Technical field
The technical field of relate generally to information processing of the present invention, more specifically, relates to the method and system for information search.
Background technology
Along with the development of cyber-net correlation technique, increasing people's combine digital search identifies or finds in digital document the certain content of the demand meeting them.Such as, people (such as father and mother) or authoritative institution can attempt in the obtainable digital document of children, find some limiting content being not suitable for child (such as go here and there, express or word), then make children away from these contents.But in a lot of situation, due to size and/or the substantial amounts of such as digital document, people or authoritative institution identify or find these limiting contents to be tasks consuming time.Therefore, need the searching method improved, be used for performing search efficiently and reduce search time.
Summary of the invention
The object of the application will be provided for performing efficiently the search of multiple string based on the first word of multiple string and the combination of length to reduce the system and method for search time.
According to the first aspect of the application, provide a kind of for the system by using the instruction in a computer of one or more processor executive resident to search for multiple string from text.This system comprises: the first memory device, for storing respectively with the pattern string that the first pattern word starts; And second memory device, for storing the first pattern word of pattern string and the combination of corresponding pattern length that start with the first pattern word.This system also comprises: search engine, for identifying that the word matched with one of described first pattern word is to be set to current word in the text iteratively; Extraction apparatus, to start with described current word for extracting iteratively and to have the substring of the substring length equal with one of pattern length; And comparer, for iteratively by described substring with there is first pattern word identical with length with the first pattern word of this substring compare with each pattern string of string length.This system can also comprise the 3rd memory device, for storing the information relevant with this substring when this substring matches to one of pattern string.
According to the another aspect of the application, provide a kind of for the method by using the instruction in a computer of one or more processor executive resident to search for multiple string from text.The method comprises: be stored in the first memory device by the pattern string started with the first pattern word respectively; And the first pattern word of pattern string started with the first pattern word respectively and the combination of corresponding pattern length are stored in the second memory device.The method also comprises: utilize search engine to identify iteratively and the word that one of described first pattern word matches in the text, to be set to current word; Utilize search engine to extract iteratively and to start with described current word and there is the substring of the substring length equal with one of pattern length; Utilize search engine iteratively by this substring with there is first pattern word identical with length with the first pattern word of this substring compare with each pattern string of string length; And if one of this substring and pattern string match, then the information relevant to this substring is stored in the 3rd memory device.
The system and method for the application can find and locate a series of with the predefined pattern string (such as Chinese word) using the language of large character set (charset) (such as Chinese) to write in the text efficiently.The technology that the application adopts considers the characteristic of the language using large character set, and can obtain linear working time and the search time of minimizing.This technology can by such as forbidding the text comprising one or more predefined pattern word in Bulletin Board Systems (BBS) thread.
Accompanying drawing explanation
In the accompanying drawings in an illustrative manner and unrestriced mode illustrates embodiments of the invention, the similar label instruction similar components in accompanying drawing, in the accompanying drawings:
Fig. 1 is the block diagram of the system for searching for multiple target pattern string (target pattern string) from text illustrated according to exemplary embodiment;
Fig. 2 is the process flow diagram of the method for searching for multiple target pattern string from text illustrated according to exemplary embodiment; And
Fig. 3 is the block diagram illustrating the machine according to exemplary embodiment with the exemplary form of computer system.
Embodiment
System and method for searching for multiple target pattern string from text will be described.In the following description, multiple detail is proposed to illustrate, to the invention provides complete understanding.But those skilled in the art should understand, the present invention also can be implemented without the need to these details.
In a lot of situation, people can attempt search digital document and find and locate some certain content.Such as, some father and mother or authoritative institution can attempt searching for the digital document open to children, to determine the harmful content that whether has to use the Asian language of large character set (such as, Chinese) to write in digital document.These harmful content can be the multiple intended target pattern strings (such as Chinese word or word) being not suitable for child, such as " pornographic ", " pornographic net ", " porny ", " violence " and " violence TV play " etc.Each target pattern string (such as " pornographic net ") has one first word (such as " look ") and a pattern string length (such as 3).
Fig. 1 is the block diagram of the system 100 for searching for multiple target pattern string from text illustrated according to exemplary embodiment.In certain embodiments, system 100 can comprise the first memory device 10, second memory device 20, the 3rd memory device 30 and search engine 40.System 100 can also comprise one or more processor 50, for performing reservation instruction in a computer to operate other assemblies.
In certain embodiments, first memory device 10 can store predefined target pattern string (such as, " pornographic ", " pornographic net ", " porny ", " violence " and " violence TV play "), these target pattern strings have the first word (such as separately, " look " or " cruelly ") and pattern string length (such as, 2,3,4 or 5).First memory device 10 can comprise a HashSet (hash set).This HashSet is the specific implementation of Set (set) interface.It creates and uses the gathering (collection) of hash table for storage.The mechanism that hash table is called as Hash (hashing) by use stores information.In certain embodiments, system 100 can also comprise user interface 60, and it can be used for receiving and will be stored in the target pattern string in the first memory device 10.
In certain embodiments, second memory device 20 can store target pattern string (such as, " pornographic ", " pornographic net ", " pornographic acute ", " porny ", " violence " and " violence TV play ") the first word (such as, " look " or " cruelly ") and pattern string length is (such as, 2,3,4 or 5) unique combinations (such as, < " look ", (2,3,4) > and < " cruelly ", (2,5) >).The pattern string started with the first pattern word (such as " look ") (such as, " pornographic ", " pornographic net ", " pornographic acute ", " porny ") the first pattern word and pattern length (such as, 2,3,3,4) combination (such as < " look ", (2,3,4) >) be unduplicated.Second memory device 20 can comprise at least one HashMap (hash figure).HashMap refers to a kind of data structure, and it uses hash function some identifier or key word (key) to be mapped to efficiently the value (such as, their telephone number) of association.Hash function is used for key word being converted to the index by the array element (groove (slot) or bucket (bucket)) therefrom finding analog value.
In certain embodiments, search engine 40 can identify and the word that one of the first pattern word (such as " look ") matches iteratively from text beginning, and this coupling word is set to the current word in text.
In certain embodiments, search engine 40 can extract iteratively and to start with current word (such as " look ") and to have the substring of the substring length equaling one of target pattern string length (such as 2).
In certain embodiments, search engine 40 can iteratively by this substring with there is the first pattern word (such as " look ") identical with string length with the first word of this substring respectively compare with each target pattern string (such as " pornographic ") of string length (such as 2).If one of the substring extracted and target pattern string (such as " pornographic ") match, the information relevant to this substring then can be stored in the 3rd memory device 30.The information relevant to this substring can comprise the position of this substring.
This process all the way moves forward to and arrives text ending.In certain embodiments, the word string navigated to is highlighted to warn user.In certain embodiments, system 100 can comprise display 70, is used for showing the information relevant to the substring be stored in the 3rd memory device 30.
System 100 can find efficiently in the text and locate and the one or more substrings matched in predefined target pattern string, to reduce search time.
Fig. 2 is the process flow diagram of the method 200 for searching for multiple target pattern string from text illustrated according to exemplary embodiment.
In certain embodiments, at operation 202 place, multiple target pattern string (such as, " pornographic ", " pornographic net ", " pornographic acute ", " porny ", " violence " and " violence TV play ") is stored in the first memory device 10 respectively.These target pattern strings start with the first pattern word (such as " look " or " cruelly ") respectively.
At operation 204 place, the target pattern string that will start with the first pattern word (such as " look " or " cruelly ") (such as, " pornographic ", " pornographic net ", " pornographic acute ", " porny ", " violence " and " violence TV play ") the first pattern word (such as " look " or " cruelly ") and corresponding pattern length (such as 2, 3, 4 or 5) unique combinations (such as, < " look ", 2, 3, 4>) with < " cruelly ", 2, 5> is stored in the second memory device 20.
At operation 206 place, search engine 40 is utilized to identify iteratively in the text and the word that one of first pattern word (such as " look ") matches, to be set to current word.
At operation 208 place, utilize search engine 40 to extract iteratively and to start with current word (such as " look ") and there is the substring of the substring length equal with one of target pattern string length (such as 2).
At operation 210 place, utilize search engine 40 iteratively by this substring with there is the first identical with substring length with the first pattern word of this substring respectively pattern word (such as " look ") compare with each target pattern string (such as " pornographic ") of target strings length (such as 2).
If this substring mates with one of target pattern string (such as " pornographic "), then at operation 212 place, the information relevant to this substring is stored in the 3rd memory device 30.The information relevant to this substring can comprise the position of this substring.Operation 206 to 212 is repeated, until arrive the ending of text.
At operation 214 place, display 70 is utilized to show the information relevant to all substrings be stored in the 3rd memory device 30.
The operation of embodiment can divide two stages to perform: initialization and process.At initial phase, target pattern word (searched word) is placed in HashSet, and the first word of each target pattern word and length are placed in HashMap.Because the HashMap in JAVA does not allow the key word of repetition, therefore the different length of the pattern word that the first word is identical is placed into HashMap.
Processing stage, text is from the starting iterated process.Each word of text is examined.If find current word in HashMap, then the substring of possibility current location is pattern word.Obtain possible length from HashMap, and for each may length, extract the substring according to current length, to check it whether in HashSet.If obtain hit, then find target pattern word in the text, starting position is current location and length is current length.Otherwise process proceeds to next may length.If likely length is all processed in institute, process then proceeds to next word of text.
Suppose that text size is M, and there is N number of target pattern string.Be A*N at initial phase required time, wherein A comprises for being stored in HashSet by word, extracts the first word of substring and word and length be stored into the constant of the time in HashMap.Processing stage required time be B*M, wherein B comprises for searching word in HashMap, extracting substring and search the constant of time of this substring when finding hit in HashSet.The T.T. complexity of this algorithm is the function of (A*N+B*M).Fig. 3 is the block diagram illustrating a machine with the exemplary form of computer system 300, can perform the set for causing machine to perform the instruction sequence of any one method in method discussed here in this computer system 300.In an alternate embodiment, this machine can be server computer, client computer, personal computer (PC), tablet PC, Set Top Box (STB), personal digital assistant (PDA), cell phone, network tool, network router, switch or bridge, maybe can perform any machine of specifying the instruction set of the action taked by this machine.In addition, although only have individual machine to be illustrated, term " machine " also can comprise any gathering of multiple machine, and these machines individually or jointly set of instructions perform any one or multiple method in the method discussed here.
Exemplary computer system 300 comprise processor 302 (such as CPU (central processing unit) (CPU), Graphics Processing Unit (GPU) or its both), primary memory 304 and static memory 306, they communicate with one another via bus 308.Computer system 300 can also comprise video display unit 310 (such as liquid crystal display (LCD) or cathode-ray tube (CRT) (CRT)).Computer system 300 also comprises Alphanumeric Entry Device 312 (such as keyboard), cursor control device 314 (such as mouse), disk drive unit 316, signal generation equipment 328 (such as loudspeaker) and Network Interface Unit 320.
Disk drive unit 316 comprises machine readable media 322, it stores any one or the multiple one or more instruction sets (such as software 324) that embody in method described herein or function.Software 324 by computer system 300 the term of execution can also reside in completely or at least partly in primary memory 304 and/or processor 320, primary memory 304 and processor 320 also form machine readable media.
Software 324 can also be sent via Network Interface Unit 320 or be received on network 326.Although machine readable media 322 is shown as single medium in the exemplary embodiment, but term " machine readable media " should be believed to comprise and store the single medium of one or more instruction set or multiple medium (such as, the buffer memory of centralized or distributed data base and/or association and server).Term " machine readable media " also will be believed to comprise the arbitrary medium that can store, encode or carry following instruction set, described instruction set performed by machine and cause machine perform embodiments of the invention method operation in any one or multiple.Term " machine readable media " therefore will be believed to comprise (but being not limited to) solid-state memory, light and magnetic medium and carrier signal.
Therefore, be described from the method and system of text search multiple target pattern string.Although the present invention is described with reference to concrete exemplary embodiment, will be seen that, various amendment and change can be made to these embodiments when not departing from wider spirit and scope of the present invention.Therefore, instructions and accompanying drawing will be regarded as illustrative rather than restrictive.

Claims (14)

1., for searching for a method for word string in the text, the method comprises:
Store respectively with the pattern string that the first word starts;
For each the first different word, the combination of each different string length of pattern string storing this first word and start with this first word;
Identify in the text and the word that one of first pattern word matches;
From text, extract string based on the string started with the word identified iteratively, and the string length equal with one of string length is stored together with the first word of coupling;
Iteratively by each string extracted with there is first word identical with this substring compare with each pattern string of string length; And
Compare determine that at least one string extracted mates with one of described pattern string based on this.
2. the method for claim 1, also comprises and receives described pattern string from user interface.
3. the method for claim 1, also comprises the information that display is relevant to the substring that at least one extracts over the display.
4. the method for claim 1, wherein described pattern string is stored as at least one HashSet.
5. method as claimed in claim 4, wherein, described combination is stored as at least one HashMap.
6. method as claimed in claim 3, wherein, the information relevant to the substring that at least one extracts comprises this substring position in the text.
7. the substring the method for claim 1, wherein at least one extracted in the text is highlighted to be illustrated.
8., for searching for a system for word string in the text, this system comprises:
For storing the device of the pattern string started with the first word respectively;
For the device of the combination of each different string length of the pattern string that each the first different word stored to this first word and start with this first word;
For identifying the device of the word matched with one of first pattern word in the text;
For extracting string and the device string length equal with one of string length stored together with the first word of coupling from text based on the string started with the word identified iteratively;
For iteratively by each string extracted and the device there is first word identical with this substring comparing with each pattern string of string length; And
For comparing based on this device determining that at least one string extracted mates with one of described pattern string.
9. system as claimed in claim 8, also comprises the device for receiving described pattern string from user interface.
10. system as claimed in claim 8, also comprises the device for the upper display information relevant to the substring that at least one extracts.
11. systems as claimed in claim 9, wherein, the information relevant to the substring that at least one extracts comprises this substring position in the text.
12. systems as claimed in claim 8, wherein, the device for storing pattern string comprises at least one HashSet.
13. systems as claimed in claim 12, wherein, the device for storing described combination comprises at least one HashMap.
14. systems as claimed in claim 8, also comprise the highlighted device illustrated of substring at least one being extracted in the text.
CN201410757944.3A 2010-02-26 2010-02-26 For searching for the method and system of multiple strings Expired - Fee Related CN104484381B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410757944.3A CN104484381B (en) 2010-02-26 2010-02-26 For searching for the method and system of multiple strings

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201010116709.XA CN102169485B (en) 2010-02-26 2010-02-26 Method and system for searching a plurality of strings
CN201410757944.3A CN104484381B (en) 2010-02-26 2010-02-26 For searching for the method and system of multiple strings

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201010116709.XA Division CN102169485B (en) 2010-02-26 2010-02-26 Method and system for searching a plurality of strings

Publications (2)

Publication Number Publication Date
CN104484381A true CN104484381A (en) 2015-04-01
CN104484381B CN104484381B (en) 2018-05-22

Family

ID=52758922

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410757944.3A Expired - Fee Related CN104484381B (en) 2010-02-26 2010-02-26 For searching for the method and system of multiple strings

Country Status (1)

Country Link
CN (1) CN104484381B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1324570A (en) * 2000-05-19 2001-12-05 住友化学工业株式会社 Heating fumigating preventing-killing method for injurious insects
US6484164B1 (en) * 2000-03-29 2002-11-19 Koninklijke Philips Electronics N.V. Data search user interface with ergonomic mechanism for user profile definition and manipulation
US20060212426A1 (en) * 2004-12-21 2006-09-21 Udaya Shakara Efficient CAM-based techniques to perform string searches in packet payloads
CN1890669A (en) * 2003-10-15 2007-01-03 施克莱无线公司 Incremental search of keyword strings
CN100557606C (en) * 2003-03-03 2009-11-04 皇家飞利浦电子股份有限公司 Be used to search the method and apparatus of string

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6484164B1 (en) * 2000-03-29 2002-11-19 Koninklijke Philips Electronics N.V. Data search user interface with ergonomic mechanism for user profile definition and manipulation
CN1324570A (en) * 2000-05-19 2001-12-05 住友化学工业株式会社 Heating fumigating preventing-killing method for injurious insects
CN100557606C (en) * 2003-03-03 2009-11-04 皇家飞利浦电子股份有限公司 Be used to search the method and apparatus of string
CN1890669A (en) * 2003-10-15 2007-01-03 施克莱无线公司 Incremental search of keyword strings
US20060212426A1 (en) * 2004-12-21 2006-09-21 Udaya Shakara Efficient CAM-based techniques to perform string searches in packet payloads

Also Published As

Publication number Publication date
CN104484381B (en) 2018-05-22

Similar Documents

Publication Publication Date Title
US8577882B2 (en) Method and system for searching multilingual documents
CN108959257B (en) Natural language parsing method, device, server and storage medium
JP6122499B2 (en) Feature-based candidate selection
US20150339348A1 (en) Search method and device
CN111737559B (en) Resource ordering method, method for training ordering model and corresponding device
CN103136228A (en) Image search method and image search device
CN107085583B (en) Electronic document management method and device based on content
US20110316796A1 (en) Information Search Apparatus and Information Search Method
WO2012132395A1 (en) Retrieval device, retrieval system, method, and program
CN111309200B (en) Method, device, equipment and storage medium for determining extended reading content
JP7140913B2 (en) Video distribution statute of limitations determination method and device
CN111858905B (en) Model training method, information identification device, electronic equipment and storage medium
CN113553414A (en) Intelligent dialogue method and device, electronic equipment and storage medium
CN111737501A (en) Content recommendation method and device, electronic equipment and storage medium
CN111984825A (en) Method and apparatus for searching video
CN115982376A (en) Method and apparatus for training models based on text, multimodal data and knowledge
CN105404677A (en) Tree structure based retrieval method
CN114238689A (en) Video generation method, video generation device, electronic device, storage medium, and program product
CN112989097A (en) Model training and picture retrieval method and device
EP2947584A1 (en) Multimodal search method and device
CN111602129B (en) Smart search for notes and ink
CN111984876B (en) Point-of-interest processing method, device, equipment and computer readable storage medium
CN109710844A (en) The method and apparatus for quick and precisely positioning file based on search engine
CN105426490A (en) Tree structure based indexing method
CN108292307A (en) With the quick operating prefix Burrow-Wheeler transformation to compressed data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180522

Termination date: 20210226

CF01 Termination of patent right due to non-payment of annual fee