CN106682170A - Application searching method and device - Google Patents

Application searching method and device Download PDF

Info

Publication number
CN106682170A
CN106682170A CN201611229802.5A CN201611229802A CN106682170A CN 106682170 A CN106682170 A CN 106682170A CN 201611229802 A CN201611229802 A CN 201611229802A CN 106682170 A CN106682170 A CN 106682170A
Authority
CN
China
Prior art keywords
application
word
label system
language material
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611229802.5A
Other languages
Chinese (zh)
Other versions
CN106682170B (en
Inventor
庞伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201611229802.5A priority Critical patent/CN106682170B/en
Publication of CN106682170A publication Critical patent/CN106682170A/en
Application granted granted Critical
Publication of CN106682170B publication Critical patent/CN106682170B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an application searching method and device. The method includes the steps that label systems of applications are built; search words uploaded by a client are received; matching is conducted in the label systems of the applications according to the search words; when the search words are matched with keywords of the label system of one application, relevant information of the application is fed back to the client to be displayed. Therefore, by building the label systems of the applications, when the search words uploaded by the client are received, matching is conducted in the label systems of the applications according to the search words, when the search words are matched with the keywords of the label system of one application, the relevant information of the application is fed back to the client to be displayed, and intelligent searching of the application is achieved. The built label systems of the applications guarantee the recall rate of application engine searching, the searching quality of the application engine searching is improved, and user experience is enhanced.

Description

A kind of application searches method and apparatus
Technical field
The present invention relates to data mining, search field, and in particular to a kind of application searches method and apparatus.
Background technology
Application searches engine is a mobile terminal software application search engine service, there is provided the application searches on mobile phone are with Carry, such as 360 mobile phone assistant, Tengxun's application treasured, Qu i xey.By taking 360 mobile phone assistant as an example, the quantity of application has millions of, Automatic mining and build application label system be improve application searches engine search quality key technology, be also to realize function The core technology of search.
Traditional application label generating method is artificial mark, and workload wastes time and energy greatly, and coverage rate is low;Or opened by application Originator submits label to, and often with cheating problem, developer expects that the application of oneself has the higher chance that represents, and submits to substantial amounts of With using unrelated label information.
The content of the invention
In view of the above problems, it is proposed that the present invention so as to provide one kind overcome the problems referred to above or at least in part solve on State a kind of application searches method of problem.
According to one aspect of the present invention, there is provided a kind of application searches method, the method includes:
Build the label system of each application;
Receive the search word of client upload;
Matched in the label system of each application according to the search word;
When the key word in the search word and the label system of an application matches, by the relevant information of the application It is back to client to be shown.
Alternatively, the label system for building each application includes:
Obtain the summary of each application;
The search word with regard to each application is obtained from application searches daily record;
Summary, search word and preset strategy according to each application, excavates the label system of each application.
Alternatively, the summary according to each application, search word and preset strategy, excavate the label system of each application Including:
According to the summary and search word of each application, corpus set is obtained;
Corpus set is input into into LDA models and is trained, obtain the application-theme probability of LDA models output Distribution results and theme-key words probabilities distribution results;
According to the application-theme probability distribution result and the theme-key words probabilities distribution results, it is calculated each Using label system.
Alternatively, the summary and search word according to each application, obtaining corpus set includes:
For each application, the word of first section word or front predetermined number sentence is extracted from the summary of the application;Will Original language material of the search word of the word that extracts and the application collectively as the application;
The original language material of each application constitutes original language material set;Pretreatment is carried out to the original language material set, is instructed Practice language material set.
Alternatively, it is described pretreatment is carried out to the original language material set to include:
In the original language material set,
For each original language material, word segmentation processing is carried out to the original language material, obtain the participle comprising multiple lexical items and tie Really;The phrase that adjacent lexical item in searching by the word segmentation result is constituted;Retain in the phrase, the word segmentation result and belong to name The lexical item of word and the lexical item for belonging to verb, as the key word that original language material correspondence retains.
Alternatively, the phrase that the adjacent lexical item during the lookup is by the word segmentation result is constituted includes:
The cPMId values of the adjacent lexical item of each two in word segmentation result are calculated, when the cPMId values of two adjacent lexical items are more than the During one predetermined threshold value, determine that the two adjacent lexical items constitute phrase.
Alternatively, it is described pretreatment is carried out to the original language material set also to include:
First stage corpus of the key word that the initial material correspondence that each is applied retains as the application;
The first stage corpus of each application constitute first stage corpus set;Language is trained to the first stage Key word in material set carries out data cleansing.
Alternatively, the key word in the first stage corpus set carries out data cleansing includes:
In the first stage corpus set,
For each first stage corpus, the TF- of each key word in the first stage corpus is calculated IDF values;TF-IDF values are deleted higher than the second predetermined threshold value and/or less than the key word of the 3rd predetermined threshold value.
Alternatively, it is described pretreatment is carried out to the original language material set also to include:
The first stage corpus that each is applied Jing after data cleansing remaining key word as the application second Stage-training language material;
For the second stage corpus that each is applied, when a key in the second stage corpus of the application When word occurs in the title of the application, the key word is repeated into the 4th in the second stage corpus of the application and presets threshold Value number of times, obtains the corpus of the application;
The corpus composing training language material set of each application.
Alternatively, it is described to be tied according to the application-theme probability distribution result and the theme-key words probabilities distribution Really, being calculated the label system of each application includes:
According to the application-theme probability distribution result and the theme-key words probabilities distribution results, being calculated should With-key words probabilities distribution results;
According to the application-key words probabilities distribution results, for each application, by key word according to regard to the application Probability sorts from big to small, chooses the key word of front 5th predetermined threshold value number.
Alternatively, it is described to be tied according to the application-theme probability distribution result and the theme-key words probabilities distribution Really, being calculated application-key words probabilities distribution results includes:
For each application, probability of each theme with regard to the application is obtained according to the application-theme probability distribution result;
For each theme, each key word is obtained with regard to the theme according to the theme-key words probabilities distribution results Probability;
Then for each key word, by the key word with regard to a theme probability and the theme with regard to the general of application Probability with regard to the application of the product of rate as the key word based on the theme;By the key word based on each theme with regard to institute State the probability sum of application as the key word with regard to the application probability.
Alternatively, it is described to be tied according to the application-theme probability distribution result and the theme-key words probabilities distribution Really, being calculated the label system of each application also includes:
The key word of the 5th predetermined threshold value number is marked as the first stage of the application before each application correspondence is chosen Label system;
For the first stage label system that each is applied, each pass in the first stage label system of the application is calculated Semantic relationship value between keyword and the summary of the application;For each key word, by the corresponding semantic relationship value of the key word With the key word with regard to the probability of the application product as the key word with regard to the application amendment probability;By the of the application Each key word in one phase tag system sorts from big to small according to the amendment probability with regard to the application, choose before K it is crucial Word constitutes the label system of the application.
Alternatively, calculate between each key word and the summary of the application in the first stage label system of the application Semantic relationship value includes:
The term vector of the key word is calculated, each lexical item in the front predetermined number sentence of the summary for calculating the application Term vector;
The cosine similarity between the term vector of the key word and the term vector of each lexical item is calculated, each cosine is similar Spend the semantic relationship value as the key word and corresponding lexical item with the product of the weight of corresponding lexical item place sentence;
Using the semantic relationship value sum of the key word and each lexical item as the language between the key word and the summary of the application Adopted relation value.
Alternatively, it is described to be tied according to the application-theme probability distribution result and the theme-key words probabilities distribution Really, being calculated the label system of each application also includes:
Second stage label system of the key word that each application correspondence is chosen as the application;
For the second stage label system that each is applied, the download behaviour with regard to the application is obtained from application searches daily record The search set of words of work, each key word counted in the second stage label system of the application is searched in set of words described DF value;For each key word, the multiple that the DF value is increased on the basis of probability of the key word with regard to the application is obtained Second-order correction probability of the key word with regard to the application;By each key word in the second stage label system of the application according to pass Sort from big to small in the second-order correction probability of the application, choose the label system that front K key word constitutes the application.
Alternatively, the label system that K key word constitutes the application before the selection includes:
The season download time with regard to the application is obtained from application searches daily record;
K key word constitutes the label system of the application before being chosen according to the season download time of the application;Wherein K values are made For the application season download time polygronal function.
According to a further aspect in the invention, there is provided a kind of application searches device, the device includes:
Label system construction unit, is suitable to build the label system of each application;
Interactive unit, is suitable to receive the search word of client upload;
Search processing, is suitable to be matched in the label system of each application according to the search word;
The interactive unit, is further adapted for when the key word in label system of the search word with an application matches When, the relevant information of the application is back to into client and is shown.
Alternatively, the label system construction unit includes:
Information acquisition unit, is suitable to obtain the summary of each application;And obtain from application searches daily record with regard to each application Search word;
Unit is excavated using label, summary according to each application, search word and preset strategy is suitable to, each application is excavated Label system.
Alternatively, the application label excavates unit, is suitable to the summary and search word according to each application, obtains corpus Set;Corpus set is input into into LDA models and is trained, obtain the application-theme probability distribution of LDA models output And theme-key words probabilities distribution results as a result;According to the application-theme probability distribution result and the theme-key Word probability distribution results, are calculated the label system of each application.
Alternatively, the application label excavates unit, is suitable to, for each is applied, from the summary of the application first section be extracted The word of word or front predetermined number sentence;By the search word of the word for extracting and the application collectively as the application original Beginning language material;The original language material of each application constitutes original language material set;Pretreatment is carried out to the original language material set, is trained Language material set.
Alternatively, the application label excavates unit, is suitable in the original language material set, for each original language Material, to the original language material word segmentation processing is carried out, and obtains the word segmentation result comprising multiple lexical items;Search by the word segmentation result Adjacent lexical item constitute phrase;Retain and belong to the lexical item of noun in the phrase, the word segmentation result and belong to the word of verb , as the key word that the original language material correspondence retains.
Alternatively, the application label excavates unit, is suitable to calculate the cPMId of the adjacent lexical item of each two in word segmentation result Value, when the cPMId values of two adjacent lexical items are more than the first predetermined threshold value, determines that the two adjacent lexical items constitute phrase.
Alternatively, the application label excavates unit, is further adapted for the key that initial material correspondence retains for applying each First stage corpus of the word as the application;The first stage corpus of each application constitute first stage corpus collection Close;Data cleansing is carried out to the key word in the first stage corpus set.
Alternatively, the application label excavates unit, is suitable in the first stage corpus set, for each First stage corpus, calculate the TF-IDF values of each key word in the first stage corpus;By TF-IDF values Delete higher than the second predetermined threshold value and/or less than the key word of the 3rd predetermined threshold value.
Alternatively, the application label excavates unit, is further adapted for the first stage corpus Jing data for applying each Second stage corpus of the remaining key word as the application after cleaning;For the second stage training language that each is applied Material, when a key word in the second stage corpus of the application occurs in the title of the application, by the key word Repeat the 4th predetermined threshold value number of times in the second stage corpus of the application, obtain the corpus of the application;Each application Corpus composing training language material set.
Alternatively, the application label excavates unit, is suitable to according to the application-theme probability distribution result and the master Topic-key words probabilities distribution results, are calculated application-key words probabilities distribution results;According to the application-key words probabilities Distribution results, for each application, key word are sorted from big to small according to the probability with regard to the application, are chosen the front 5th and are preset The key word of threshold number.
Alternatively, the application label excavates unit, is suitable to for each is applied, according to the application-theme probability point Cloth result obtains probability of each theme with regard to the application;For each theme, according to the theme-key words probabilities distribution results Obtain probability of each key word with regard to the theme;Then for each key word, by the key word with regard to a theme probability with The theme is based on the probability with regard to the application of the theme with regard to the product of the probability of an application as the key word;Should Key word based on each theme with regard to the application probability sum as the key word with regard to the application probability.
Alternatively, the application label excavates unit, the 5th predetermined threshold value before being further adapted for choosing each application correspondence First stage label system of the key word of number as the application;For the first stage label system that each is applied, calculate The semantic relationship value between each key word and the summary of the application in the first stage label system of the application;For each Key word, using the corresponding semantic relationship value of the key word and the key word with regard to the probability of the application product as the key word With regard to the amendment probability of the application;By each key word in the first stage label system of the application according to repairing with regard to the application Positive probability sorts from big to small, chooses the label system that front K key word constitutes the application.
Alternatively, the application label excavates unit, is suitable to calculate the term vector of the key word, calculates the summary of the application Front predetermined number sentence in each lexical item term vector;Calculate the term vector of the key word and the term vector of each lexical item Between cosine similarity, using the product of each cosine similarity and the weight of corresponding lexical item place sentence as the key word with The semantic relationship value of corresponding lexical item;Using the semantic relationship value sum of the key word and each lexical item as the key word and the application Semantic relationship value between summary.
Alternatively, the application label excavates unit, and the key word for being further adapted for choosing each application correspondence should as this Second stage label system;For the second stage label system that each is applied, obtain from application searches daily record with regard to The search set of words of the down operation of the application, counts each key word in the second stage label system of the application described DF value in search set of words;For each key word, increase described on the basis of probability of the key word with regard to the application The multiple of DF value obtains second-order correction probability of the key word with regard to the application;By in the second stage label system of the application Each key word sorts from big to small according to the second-order correction probability with regard to the application, and K key word constitutes the application before choosing Label system.
Alternatively, the application label excavates unit, is suitable to from application searches daily record obtain the season with regard to the application Download time;K key word constitutes the label system of the application before being chosen according to the season download time of the application;Wherein K values As the application season download time polygronal function.
From the foregoing, the label system that the technical scheme that the present invention is provided passes through each application of structure, when reception client During the search word of upload, matched in the label system of each application according to the search word, as the search word and one Using label system in key word match when, the relevant information of the application is back to into client and is shown, realize Using intelligent search, wherein building the label system of each application, it is ensured that the recall rate of application engine search, improve The search quality of application engine search, enhances Consumer's Experience.
Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention, And can be practiced according to the content of description, and in order to allow the above and other objects of the present invention, feature and advantage can Become apparent, below especially exemplified by the specific embodiment of the present invention.
Description of the drawings
By the detailed description for reading hereafter preferred implementation, various other advantages and benefit is common for this area Technical staff will be clear from understanding.Accompanying drawing is only used for illustrating the purpose of preferred implementation, and is not considered as to the present invention Restriction.And in whole accompanying drawing, it is denoted by the same reference numerals identical part.In the accompanying drawings:
Fig. 1 shows a kind of flow chart of application searches method according to an embodiment of the invention;
Fig. 2 shows the interface schematic diagram scanned for based on application searches method according to an embodiment of the invention;
Fig. 3 shows a kind of flow chart of label system for building each application according to an embodiment of the invention;
Fig. 4 shows a kind of schematic diagram of application searches device according to an embodiment of the invention;
Fig. 5 shows the schematic diagram of label system construction unit according to an embodiment of the invention.
Specific embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in accompanying drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure and should not be by embodiments set forth here Limited.On the contrary, there is provided these embodiments are able to be best understood from the disclosure, and can be by the scope of the present disclosure Complete conveys to those skilled in the art.
Fig. 1 shows a kind of flow chart of application searches method according to an embodiment of the invention, as shown in figure 1, should Application searches method 100 includes:
S110, builds the label system of each application.
S120, receives the search word of client upload.
S130, is matched according to the search word in the label system of each application.
S140, when the key word in the search word and the label system of an application matches, by the phase of the application Pass information is back to client and is shown.
This programme can be seen that by building the label system of each application by the method shown in Fig. 1, when reception client During the search word of upload, matched in the label system of each application according to the search word, as the search word and one Using label system in key word match when, the relevant information of the application is back to into client and is shown, realize Using intelligent search, wherein building the label system of each application, it is ensured that the recall rate of application engine search, improve The search quality of application engine search, enhances Consumer's Experience.
For example, user's search " drop drop ", application engine is also opened up simultaneously in addition to returning " drop oozes row " this accurate application Now there is the application of identity function with it, such as " fast calls a taxi ", " the excellent step China of Uber ".
In order that the scheme of the application searches method is clearer, above-mentioned application is described by a specific example The implementation process of searching method, Fig. 2 is shown and according to an embodiment of the invention is scanned for based on application searches method Interface schematic diagram.Illustrate with reference to a specific example.In a specific example, user is " 360 mobile phones are helped Search keyword " making a reservation " on handss ", the result that " 360 mobile phone assistant " search engine represents is as shown in Fig. 2 from Fig. 2, can see Go out, when user searches for " making a reservation ", the search engine of " 360 mobile phone assistant " except returning all applications with function of making a reservation, such as " group of U.S. takes out ", " being hungry ", " Baidu's Oryza glutinosa ", " masses' comment ", " group of U.S. " etc..It follows that the application that the present invention builds Label body is tied up to and serve in retrieval ordering Main Function, and search quality significantly improves, and improves user's search experience.
In the application searches method shown in Fig. 1, the process of the label system of each application of step S110 structure is determined should With the quality of search effect, therefore step S110 is described in detail:Fig. 3 shows according to an embodiment of the invention A kind of flow chart of the method for the label system for building each application;Referring to Fig. 3, the method includes:
Step S310, obtains the summary of each application.
Step S320, obtains the search word with regard to each application from application searches daily record.
Step S330, the summary, search word and preset strategy according to each application, excavates the label system of each application.
By the method shown in Fig. 3, by the summary for obtaining each application automatically, and daily record is searched for from the historical usage of user In obtain the search word of each application in real time, dynamic updates applies label;Simultaneously by preset strategy, application is continuously improved The accuracy rate and recall rate of label, and then the label system of application is excavated and creates, solving tradition can only using label system The problems such as by manually marking big caused labor workload, low coverage rate and serious cheating, substantially increase to apply and search The search quality that index is held up, improves user's search experience.
In one embodiment of the invention, above-mentioned steps S330 are according to the summary of each application, search word and default plan Slightly, excavating the label system of each application includes:
Step S331, according to the summary and search word of each application, obtains corpus set.
Step S332, corpus set is input into into LDA models and is trained, obtain LDA models output application- Theme probability distribution result and theme-key words probabilities distribution results.
Step S333, according to the application-theme probability distribution result and the theme-key words probabilities distribution results, It is calculated the label system of each application.
It should be noted that LDA (Latent Dirichlet Allocation) is a kind of document subject matter generation model, It is a kind of non-supervisory machine learning techniques, can be used to recognize extensive document sets (document collection) or language The subject information hidden in material storehouse (corpus).The method that it employs bag of words (bag of words), this method will be each Piece document is considered as a word frequency vector, so as to text message to be converted the digital information for ease of modeling.Because LDA models exist Show preferable in long text, effect is poor on short text, but apply summary very short and small, be a kind of typical short text, be The application effect for making LDA models reaches most preferably, and introducing application, with the interactive history of user (i.e. described search word, hereafter claims For search word) information to being extended using summary, will apply summary short text be extended to the long article for being suitable to LDA models This.Wherein, search word not only can retrieve the lexical item of the application, also including other lexical items, lucky gram of these lexical items comprising engine The problems such as applying the too short synonymous homophone frequency brought of summary short text length too low is taken.
In the present embodiment, LDA model selections GibbsLDA++ versions.Need to repair under the application scenarios of mobile terminal application Change GibbsLDA++ source codes, the theme of same lexical item in an application is initialized as same.In original generation Code in be each lexical item random initializtion into a theme, cause it is same repeat lexical item can be initialized as multiple themes, Because under mobile terminal application scenarios, the label of an application is often all clear and definite, rarely has ambiguousness, so same word Item is initialized to the application scenarios that same theme meets mobile terminal application, can also improve the effect of LDA models.
In order that the solution of the present invention becomes apparent from, here, what the LDA models to mentioning in step S332 were exported should Detailed illustration is carried out with-theme probability distribution result and theme-key words probabilities distribution results.For example, LDA training Select 120 themes, iteration 300 to take turns, generate two files, wherein, first file is theme-key words probabilities distribution knot Really, as shown in table 1, show the 4th theme respectively with the corresponding probability between 22 key words:
Table 1
Second file is application-theme probability distribution result, as shown in table 2, is shown using the application that ID is 5427 Respectively with the corresponding probability between 6 themes (theme ID is respectively 134,189,139,126,14,18).
Table 2
In order that the solution of the present invention is clearer, illustrate with reference to a specific example.Such as " wechat " Summary include that " wechat (WeChat) is that release on January 21st, 2011 one of Tencent provides immediately for intelligent terminal The free application program of Communications service.Wechat is supported quickly to be sent by network across common carrier, spanning operation system platform and exempted from Expense (need to consume a small amount of network traffics) voice SMS, video, picture and word ", the search word of wechat includes " wechat, free Instant messaging, Tengxun, circle of friends, public platform, message are pushed, shaken, neighbouring people, sweep Quick Response Code mode add good friend, Many people's calls ".
So described corpus set includes the institute of all clip Texts of above-mentioned " wechat " and the search word of " wechat " There is content;To be trained in the described LDA models of described corpus set input, if LDA models are directed to " wechat " The theme that generates of corpus set include social activity, the keyword of generation include chat, voice, phone, phone directory, social activity, Friend-making, communication, address list, friend, then obtaining the application-theme probability distribution result of LDA models output includes that P1.1 is (micro- Letter-social);The theme-key word the distribution results for obtaining the output of LDA models are P2.1 (wechat-chat), P2.2 (wechats-language Sound), P2.3 (wechat-phone), P2.4 (wechat-phone directory), P2.5 (wechat-social activity), P2.6 (wechat-friend-making), P2.7 it is (micro- Letter-communication), P2.8 (wechat-address list), P2.9 (wechat-friend);It is (micro- according to described P1.1 (wechat-social activity) and P2.1 Letter-chat), P2.2 (wechat-voice), P2.3 (wechat-phone), P2.4 (wechat-phone directory), P2.5 (wechat-social activity), P2.6 (wechat-friend-making), P2.7 (wechat-communication), P2.8 (wechat-address list), P2.9 (wechat-friend) are calculated wechat Label system it is as shown in table 3.
Table 3
It follows that according to the summary and search word of each application, corpus set is obtained, then by LDA models pair The corpus set of acquisition is processed, and generates corresponding application-theme probability distribution result and theme-key words probabilities Distribution results, and then according to the application-theme probability distribution result and the theme-key words probabilities distribution results, calculate The label system of each application is obtained, is realized and more comprehensively and accurately representing for text is described to application content or function.
Because in the actual popularization of existing application, the label of application is directly submitted to by developer, in submission application label During, the developer of application retouches to allow the application of oneself to obtain the installation of numerous clients and use in the label of application Have submitted in stating it is substantial amounts of cause deceptive information label phenomenon long-term existence with using unrelated content, had a strong impact on application The search quality of search engine, greatly reduces user's search experience, in order to solve this problem, in an enforcement of the present invention In example, according to the summary and search word of each application, obtain corpus set includes above-mentioned steps S331:For each application, The word of first section word or front predetermined number sentence is extracted from the summary of the application;By the word for extracting and the application Original language material of the search word collectively as the application;The original language material of each application constitutes original language material set;To the original language Material set carries out pretreatment, obtains corpus set.
For example, for " wechat " this application, obtaining the summary of " wechat " includes:
" wechat is a social software.
Wechat provides the functions such as public platform, circle of friends, message push, and user can be by " shaking ", " search number Code ", " people nearby ", sweep Quick Response Code mode add good friend and concern public platform, while wechat by content be shared with good friend with And the splendid contents for seeing user share wechat circle of friends.Wechat is supported to lead to across common carrier, spanning operation system platform Cross network and quickly send free (a small amount of network traffics need to be consumed) voice SMS, video, picture and word, it is also possible to make With the data and location-based social plug-in unit " shaking " by sharing streaming medium content, " drift bottle ", " circle of friends ", " public affairs The service plug such as many platforms ", " voice memo sheet ".
By the end of the first quarter in 2015, wechat covered China more than 90% smart mobile phone, the moon any active ues reach To 5.49 hundred million, user covers more than 200 countries, more than 20 kinds of language.Additionally, the wechat public account sum of various brands has surpassed 8,000,000 are crossed, Mobile solution docks quantity more than 85000, wechat pays user and then reached 400,000,000 or so.”
Last sentence is extracted from the summary of above-mentioned " wechat " includes " wechat is a social software ", while obtaining The search word of " wechat " includes " chat, voice, phone, phone directory, social activity, friend-making, communication, address list, friend ", will be above-mentioned " wechat is a social software " and " chat, voice, phone, phone directory, social activity, friend-making, communication, address list, friend " conduct The original language material of " wechat ";The original it is anticipated that all applications of other application is obtained by way of obtaining " wechat " original language material Original language material constitute original language material set;Pretreatment is carried out to the original language material set, corpus set is obtained.
Specifically, it is described pretreatment is carried out to the original language material set to include:In the original language material set, for Each original language material, to the original language material word segmentation processing is carried out, and obtains the word segmentation result comprising multiple lexical items;Search by described The phrase that adjacent lexical item in word segmentation result is constituted;Retain the lexical item and category for belonging to noun in the phrase, the word segmentation result In the lexical item of verb, as the key word that the original language material correspondence retains.
For example, in the original language material set, the original language material collection of " wechat " be combined into " wechat be a social software, Chat, voice, make a phone call, phone directory, social activity, friend-making, communication, address list, friend ", the original language material of " wechat " is carried out point Word process, obtaining the word segmentation result comprising multiple lexical items includes " wechat, being, a, social, software, chat, voice, beating electricity Words, phone directory, social activity, friend-making, communication, address list, friend ", it is short that the adjacent lexical item in searching by the word segmentation result is constituted Language include " wechat, a, social, software, chat, voice, make a phone call, phone directory, social activity, friend-making, communication, address list, friend Friend ", to retain and belong to the lexical item of noun in the phrase, the word segmentation result and belong to the lexical item of verb, as the original language material Correspondence retain key word, then the key word of " wechat " include " wechat, social activity, chat, voice, make a phone call, phone directory, social activity, Friend-making, communication, address list, friend ".
Wherein, in order to judge whether to constitute a phrase, by calculating the compactness of two lexical items in front and back realizing, at this In one embodiment of invention, the lookup by the word segmentation result in the phrase that constitutes of adjacent lexical item include:Calculate participle The cPMId values of the adjacent lexical item of each two in as a result, when the cPMId values of two adjacent lexical items are more than the first predetermined threshold value, really Fixed the two adjacent lexical items constitute phrase.
For example, the first predetermined threshold value is set as 5, obtain the word segmentation result of " Baidu map " " province, flow, public transport, to change Take advantage of ", the cPMId values of " province, flow ", " flow, public transport " and " public transport, transfer " are calculated using cPMId calculations, if calculated Obtain " province, flow ", the cPMId values of " public transport, transfer " are more than 5, then it is determined that " province, flow ", " public transport, transfer " constitute phrase " province's flow ", " Public Transport Transfer ", if if being calculated " flow, public transport " cPMId values less than 5, then it is determined that " flow, public affairs Hand over " phrase can not be constituted.
It should be noted that cPMId calculations are as shown in Equation 1,
In formula 1, δ=0.7, d (x, y) represents the co-occurrence frequency of two lexical items x, y, and d (x) represents the appearance frequency of lexical item x Number, d (y) represents the appearance frequency of lexical item y, and D represents total number of applications.
Further, in one embodiment of the invention, it is described pretreatment is carried out to the original language material set also to wrap Include:First stage corpus of the key word that the initial material correspondence that each is applied retains as the application;Each application First stage corpus constitute first stage corpus set;To the key word in the first stage corpus set Carry out data cleansing.
Specifically, due in the application of million magnitudes, the lexical item that superfrequency occurs is that the probability of label is less, together The lexical item of one low frequency occurrence of sample is that the probability of label is also less, therefore our data cleansing process can be by superfrequency The key word that the key word of appearance and ultra-low frequency occur is filtered out.
For example, " wechat " initial material correspondence retain key word key word include " wechat, social activity, chat, voice, Make a phone call, phone directory, social activity, friend-making, communication, address list, friend ", then will " wechat, social activity, chat, voice, make a phone call, First stage corpus of phone directory, social activity, friend-making, communication, address list, the friend " as " wechat ";So all applications First stage corpus just constitute first stage corpus set, and in the first stage corpus set Key word carries out data cleansing, filters out the lexical item of low frequency occurrence in first stage corpus set, and then improves application The quality of search engine.
The key that the key word and ultra-low frequency that superfrequency occurs in order to filter out first stage corpus set occurs Word, in one embodiment of the invention, it is clear that the key word in the first stage corpus set carries out data Wash bags are included:In the first stage corpus set, for each first stage corpus, the first stage is calculated The TF-IDF values of each key word in corpus;TF-IDF values are preset higher than the second predetermined threshold value and/or less than the 3rd The key word of threshold value is deleted.
In said process, each key word in the first stage corpus is calculated using TF-IDF computing formula TF-IDF values, realize the further cleaning to data.
For example, the first stage corpus of " wechat " include " wechat, social activity, chat, voice, make a phone call, phone directory, Social activity, friend-making, communication, address list, friend ", using the computing formula of TF-IDF, the first stage of " wechat " described in calculating trains Each lexical item, the TF-IDF values of phrase are calculated in language material, obtain TF-IDF (wechat), TF-IDF (social activity), TF-IDF (chat), TF-IDF (voice), TF-IDF (making a phone call), TF-IDF (phone directory), TF-IDF (social activity), TF-IDF (friend-making), TF-IDF (communication), TF-IDF (address list), TF-IDF (friend);If TF-IDF (communication), TF-IDF (address list), TF-IDF (friends Friend) higher than the second predetermined threshold value and/or less than the 3rd predetermined threshold value, then " communication, address list, friend " is deleted.Need Bright, described the second predetermined threshold value and/or relevant with concrete language material less than the 3rd predetermined threshold value does not list concrete valve herein Value.Simultaneously why the present invention carries out cleaning to data and is because that TF-IDF can well assess a words pair using TF-IDF The significance level of a copy of it file in a file set or a corpus, has fully met data cleansing of the present invention Need.
The computing formula of TF-IDF is as follows:
In formula 2, count (w, app) is lexical item w word frequency in app, and count (w, Corpus) is w words in language material Frequently, nCorpus is total app quantity, and app_count (w) is the app quantity comprising lexical item w
Further, in one embodiment of the invention, it is described pretreatment is carried out to the original language material set also to wrap Include:The first stage corpus that each is applied remaining key word Jing after data cleansing is instructed as the second stage of the application Practice language material;For the second stage corpus that each is applied, when a key in the second stage corpus of the application When word occurs in the title of the application, the key word is repeated into the 4th in the second stage corpus of the application and presets threshold Value number of times, obtains the corpus of the application;The corpus composing training language material set of each application.
For example, the first stage corpus of " wechat " include " wechat, social activity, chat, voice, make a phone call, phone directory, Social activity, friend-making, communication, address list, friend ", Jing data cleansings are processed and remove " communication, address list, friend ", then remaining pass Keyword include " wechat, social activity, chat, voice, make a phone call, phone directory, social activity, friend-making " be " wechat " second stage train Language material;
Find when second stage language material is analyzed, the label of Expression and Application function or classification often occurs in name, such as In " take-away ", " concavo-convex to hire a car " in " calling a taxi ", " public praise take-away " in " tick and call a taxi " " hire a car ", in " Baidu map " " map " etc., in order to project the important label of this class, in each language material applied, repetition is set forth in using appearance in name Lexical item three times, by phrase of the cPMId values higher than 10.0 similarly in triplicate, to improve going out for these potential important phrases labels The existing frequency, so far, the corpus set construction complete of LDA topic models, corpus set is stored in file app_ In corpus_seg_nouns_verb_phrase_filtered_repeat.txt.
In one embodiment of the invention, above-mentioned steps S133 are according to the application-theme probability distribution result and institute Theme-key words probabilities distribution results are stated, being calculated the label system of each application includes:
According to the application-theme probability distribution result and the theme-key words probabilities distribution results, being calculated should With-key words probabilities distribution results;According to the application-key words probabilities distribution results, for each application, key word is pressed Sort from big to small according to the probability with regard to the application, choose the key word of front 5th predetermined threshold value number.
For example, the 5th predetermined threshold value is set as into theme probability distribution for being each using under that 8, LDA models are exported, with And the lexical item probability distribution under each theme.It is general to theme probability distribution, key word respectively in order to obtain the label of each application Rate distribution selects each using lower front 50 themes according to probability backward sequence from big to small, selects first 120 under each theme Key word, the probability of key word is weighted sequence using the probability of theme, and each key application word has a weight, represents Importance under the application, according to this label weight backward sequence, and chooses front 8 key words, has just obtained LDA generations List of labels, containing many noises, the order of label is also inaccurate, as shown in table 4.
Table 4
It is wherein, described according to the application-theme probability distribution result and the theme-key words probabilities distribution results, Being calculated application-key words probabilities distribution results includes:
For each application, probability of each theme with regard to the application is obtained according to the application-theme probability distribution result; For each theme, according to the theme-key words probabilities distribution results probability of each key word with regard to the theme is obtained;It is then right In each key word, using the key word with regard to a theme probability and the theme with regard to the probability of an application product as Probability with regard to the application of the key word based on the theme;The key word is based on into probability of each theme with regard to the application Sum as the key word with regard to the application probability.
For example, a key word using C is A, and the corresponding themes of key word A include B1, B2 and B3, key word A It is P (A_B1) with regard to the probability of a theme B1, theme B1 is P (B1_C) with regard to a probability using C, then P (A_ B1) * P (B1_C) are exactly that key word A is based on theme B1 with regard to the probability using C;So P (A_B2) * P (B2_C) are exactly key word A is based on theme B2 with regard to the probability using C;P (A_B3) * P (B3_C) are exactly that key word A is based on theme B2 with regard to using the general of C Rate, then probability P (A_C)=P (A_B1) * P (B1_C)+P (A_B2) * P (B2_C)+Ps of the key word A with regard to the application C (A_B3)*P(B3_C)。
Then on this basis, further in one embodiment of the invention, it is described according to the application-theme probability Distribution results and the theme-key words probabilities distribution results, being calculated the label system of each application also includes:
The key word of the 5th predetermined threshold value number is marked as the first stage of the application before each application correspondence is chosen Label system;For the first stage label system that each is applied, each pass in the first stage label system of the application is calculated Semantic relationship value between keyword and the summary of the application;For each key word, by the corresponding semantic relationship value of the key word With the key word with regard to the probability of the application product as the key word with regard to the application amendment probability;By the of the application Each key word in one phase tag system sorts from big to small according to the amendment probability with regard to the application, choose before K it is crucial Word constitutes the label system of the application.
For example, it is assumed that the 5th predetermined threshold value is 3, the key of the front 5th predetermined threshold value number that " Baidu map " correspondence is chosen Word includes " map, search and navigation ", then by " map, search and navigation " as the first stage label body of " Baidu map " System;
For the first stage label system of " Baidu map ", first stage label system " in " Baidu map " is calculated Figure, search and navigate " in each key word and the summary of " Baidu map " between semantic relationship value be respectively R1, R2 and R3;Calculate each key word and " Baidu ground in first stage label system in " Baidu map " " map, search and navigation " The probability of figure " is P1, P2 and P3;So using R1*P1, R2*P2 and R3*P3 as " Baidu map " amendment probability, if R1* P1>R3*P3>R2*P2, then the order of each key word in the first stage label system of " Baidu map " is " map, navigation And search ", if choosing the label system that 2 keywords constitute the application, then the label system of " Baidu map " includes " Figure and navigation "
Wherein specifically, calculate each key word in the first stage label system of the application and the application summary it Between semantic relationship value include:
The term vector of the key word is calculated, each lexical item in the front predetermined number sentence of the summary for calculating the application Term vector;The cosine similarity between the term vector of the key word and the term vector of each lexical item is calculated, each cosine is similar Spend the semantic relationship value as the key word and corresponding lexical item with the product of the weight of corresponding lexical item place sentence;By the key word With the semantic relationship value sum of each lexical item as the semantic relationship value between the key word and the summary of the application.
For example, the search set of words for obtaining first from application searches engine search daily record, as the defeated of training term vector Enter data, training obtains a 300 and ties up term vector lexicon file tag_query_w2v_300.dict." if Baidu map " Key word includes " map, search and navigation ", and the term vector for calculating " map " is M1;Calculate 3 before the summary of " Baidu map " The term vector of each lexical item in sentence is respectively N1, N2 and N3;Before the summary of the term vector and Baidu map of calculating " map " " The cosine similarity of the term vector of each lexical item in individual sentence obtains " cos M1*N1 ", " cos M1*N2 " and " cos M1* N3”;The weight of corresponding lexical item place sentence is Q1 and Q2;So the key word is respectively with the semantic relationship value of corresponding lexical item " Q1*cos M1*N1 " and " Q2*cos M1*N2 ";So " Q1*cos M1*N1+Q2*cos M1*N2+Q3*cos M1*N3 " makees For the semantic relationship value between " map " and " Baidu map " summary.
Further, in one embodiment of the invention, it is described according to the application-theme probability distribution result and institute Theme-key words probabilities distribution results are stated, being calculated the label system of each application also includes:
Second stage label system of the key word that each application correspondence is chosen as the application;For each is applied Second stage label system, obtains the search set of words of the down operation with regard to the application from application searches daily record, and statistics should Using second stage label system in each key word it is described search set of words in DF value;For each key word, The multiple for increasing the DF value on the basis of probability of the key word with regard to the application obtains the key word with regard to the application Second-order correction probability;Each key word in the second stage label system of the application is general according to the second-order correction with regard to the application Rate sorts from big to small, chooses the label system that front K key word constitutes the application.
For example, excavating to the historical search set of words for downloading " Baidu map " includes " map, search and navigation ", calculates To key word " map " the historical search set of words of " Baidu map " DF value be DF1, calculate key word " search " in " Baidu The DF value of the historical search set of words of map " is DF2, calculates historical search set of words of the key word " navigation " in " Baidu map " DF value be DF3;It is P1, P2 and P3 that " map ", " search " and " navigation " is calculated with regard to the probability of " Baidu map ";So Key word " map " is P1* (1+DF1) with regard to the second-order correction probability of " Baidu map ";Key word " search " is with regard to " Baidu ground The second-order correction probability of figure " is P2* (1+DF2);Key word " navigation " is P3* (1 with regard to the second-order correction probability of " Baidu map " +DF3)。
If P3* (1+DF3)>P1*(1+DF1)>P2* (1+DF2), then the order adjustment of the key word of " Baidu map " For " map, navigation and search ", if choosing the label system that the first two key word constitutes " Baidu map ", then " Baidu ground The label system of figure " includes " map, navigation ".The label order accuracy rate of " Baidu map " is big after the adjustment of said method Amplitude is lifted.If the result once corrected to " public praise take-away " and " Baidu map " is as shown in table 5,
Table 5
The result for carrying out second-order correction to " public praise take-away " and " Baidu map " is as shown in table 6:
Table 6
By the contrast of table 5 and table 6, it will be seen that after second-order correction, the label order accuracy rate of application is big Amplitude is lifted.
In a specific example, the label system that K key word constitutes the application before the selection includes:
The season download time with regard to the application is obtained from application searches daily record;
K key word constitutes the label system of the application before being chosen according to the season download time of the application;Wherein K values are made For the application season download time polygronal function.
The list of labels applied ,@k accuracys rate and application whether popular relevant, the season of label are found in actual applications Download time reflects whether hot topic just, and each application remains three to 15 labels not waited, and accuracy rate 92% is recalled Rate 76%, quantity is directly proportional to season download time.Exemplary is as shown in table 7.
Table 7
Fig. 4 shows a kind of schematic diagram of application searches device according to an embodiment of the invention, as shown in figure 4, should Application searches device 400 includes:
Label system construction unit 410, is suitable to build the label system of each application.
Interactive unit 420, is suitable to receive the search word of client upload.
Search processing 430, is suitable to be matched in the label system of each application according to the search word.
Wherein, interactive unit 420, are further adapted for when the key word phase in label system of the search word with an application Timing, is back to the relevant information of the application client and is shown.
This programme can be seen that by building the label system of each application by the device shown in Fig. 4, when reception client During the search word of upload, matched in the label system of each application according to the search word, as the search word and one Using label system in key word match when, the relevant information of the application is back to into client and is shown, realize Using intelligent search, wherein building the label system of each application, it is ensured that the recall rate of application engine search, improve The search quality of application engine search, enhances Consumer's Experience.
Fig. 5 shows the schematic diagram of label system construction unit according to an embodiment of the invention, as shown in figure 5, should Label system construction unit 500 includes:
Information acquisition unit 510, is suitable to obtain the summary of each application;And obtain from application searches daily record with regard to respectively should Search word.
Unit 520 is excavated using label, summary according to each application, search word and preset strategy is suitable to, is excavated each Using label system.
It should be noted that label system construction unit 500 has identical work(with the label system construction unit 410 in Fig. 4 Energy.
In one embodiment of the invention, unit 520 is excavated using label, is suitable to the summary according to each application and search Word, obtains corpus set;Corpus set is input into into LDA models and is trained, obtain answering for LDA models output With-theme probability distribution result and theme-key words probabilities distribution results;According to the application-theme probability distribution result With the theme-key words probabilities distribution results, the label system of each application is calculated.
Wherein, the application label excavates unit 520, is suitable to, for each is applied, extract first from the summary of the application The word of section word or front predetermined number sentence;By the word for extracting and the search word of the application collectively as the application Original language material;The original language material of each application constitutes original language material set;Pretreatment is carried out to the original language material set, is instructed Practice language material set.
In one embodiment, the application label excavates unit 520 and the process of pretreatment is carried out to original language material set Can be:In the original language material set, for each original language material, word segmentation processing is carried out to the original language material, obtained Word segmentation result comprising multiple lexical items;The phrase that adjacent lexical item in searching by the word segmentation result is constituted;Retain the phrase, The lexical item for belonging to noun in the word segmentation result and the lexical item for belonging to verb, as the key word that the original language material correspondence retains.
Specifically, the application label excavates unit 520, the adjacent lexical item of each two being suitable in calculating word segmentation result CPMId values, when the cPMId values of two adjacent lexical items are more than the first predetermined threshold value, determine that the two adjacent lexical items constitute phrase.
Further, in another embodiment, the application label excavates unit 520, is further adapted for apply each First stage corpus of the key word that initial material correspondence retains as the application;The first stage corpus of each application Constitute first stage corpus set;Data cleansing is carried out to the key word in the first stage corpus set.
Specifically, the application label excavates unit 520, is suitable in the first stage corpus set, for Each first stage corpus, calculates the TF-IDF values of each key word in the first stage corpus;By TF- IDF values are deleted higher than the second predetermined threshold value and/or less than the key word of the 3rd predetermined threshold value.
Further, in a further embodiment, the application label excavates unit 520, is further adapted for apply each The first stage corpus second stage corpus of remaining key word as the application Jing after data cleansing;For each Using second stage corpus, when a key word in the second stage corpus of the application is in the title of the application In when occurring, the key word is repeated into the 4th predetermined threshold value number of times in the second stage corpus of the application, obtain answering Corpus;The corpus composing training language material set of each application.
In one embodiment of the invention, the application label excavates unit 520 according to the application-theme probability point Cloth result and the theme-key words probabilities distribution results, being calculated the concrete mode of the label system of each application is:According to Application-theme probability distribution the result and the theme-key words probabilities distribution results, are calculated application-key word general Rate distribution results;According to the application-key words probabilities distribution results, for each application, by key word according to should with regard to this Probability sorts from big to small, chooses the key word of front 5th predetermined threshold value number.
Wherein, the application label excavates unit 520, is suitable to for each is applied, according to the application-theme probability point Cloth result obtains probability of each theme with regard to the application;For each theme, according to the theme-key words probabilities distribution results Obtain probability of each key word with regard to the theme;Then for each key word, by the key word with regard to a theme probability with The theme is based on the probability with regard to the application of the theme with regard to the product of the probability of an application as the key word;Should Key word based on each theme with regard to the application probability sum as the key word with regard to the application probability.
Further, in one embodiment, the application label excavates unit 520, is further adapted for each application correspondence First stage label system of the key word of the front 5th predetermined threshold value number chosen as the application;For each apply One phase tag system, calculates between each key word and the summary of the application in the first stage label system of the application Semantic relationship value;For each key word, by the corresponding semantic relationship value of the key word with the key word with regard to the general of the application The product of rate as the key word with regard to the application amendment probability;By each key in the first stage label system of the application Word sorts from big to small according to the amendment probability with regard to the application, chooses the label system that front K key word constitutes the application.
Specifically, the application label excavates unit 520, is suitable to calculate the term vector of the key word, calculates the application The term vector of each lexical item in the front predetermined number sentence of summary;Calculate the term vector of the key word and the word of each lexical item Cosine similarity between vector, using the product of each cosine similarity and the weight of corresponding lexical item place sentence as the key The semantic relationship value of word and corresponding lexical item;The key word is answered as the key word with the semantic relationship value sum of each lexical item with being somebody's turn to do Semantic relationship value between summary.
Further, in another embodiment, the application label excavates unit 520, is further adapted for each using right Second stage label system of the key word that should be chosen as the application;For the second stage label system that each is applied, from The search set of words of the down operation with regard to the application is obtained in application searches daily record, the second stage label body of the application is counted DF value of each key word in system in the search set of words;For each key word, in the key word with regard to the application Probability on the basis of increase the multiple of the DF value and obtain second-order correction probability of the key word with regard to the application;This is applied Second stage label system in each key word sort from big to small according to the second-order correction probability with regard to the application, before selection K key word constitutes the label system of the application.
In one embodiment of the invention, the application label excavates unit 520, is suitable to from application searches daily record obtain Take the season download time with regard to the application;The application is constituted according to K key word before the season download time selection of the application Label system;Wherein K values as the application season download time polygronal function.
It should be noted that the course of work 400 of the application searches device in the present embodiment and described in embodiment one That plants application searches method realizes that step has correspondence identical function, and identical part repeats no more.
In sum, the technology that the present invention is provided passes through the summary for obtaining each application automatically, and from the historical usage of user Obtain the search word of each application in search daily record in real time, with expanded application short text, and realize that label is applied in dynamic renewal;Simultaneously Preset strategy is formulated by effectively training unsupervised LDA learning models, continuously improve using the accurate of label with being played The effect of rate and recall rate, and then the label system of application is excavated and creates, it is equally applicable to the new application for producing, solve biography System is using label system can only by labor workload caused by artificial mark, big, coverage rate be low and cheating is serious etc. asks Topic, substantially increases the search quality of application searches engine, improves user's search experience.
It should be noted that:
Provided herein algorithm and display be not inherently related to any certain computer, virtual bench or miscellaneous equipment. Various fexible units can also be used together based on teaching in this.As described above, construct required by this kind of device Structure be obvious.Additionally, the present invention is also not for any certain programmed language.It is understood that, it is possible to use it is various Programming language realizes the content of invention described herein, and the description done to language-specific above is to disclose this Bright preferred forms.
In description mentioned herein, a large amount of details are illustrated.It is to be appreciated, however, that the enforcement of the present invention Example can be put into practice in the case of without these details.In some instances, known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this description.
Similarly, it will be appreciated that in order to simplify the disclosure and help understand one or more in each inventive aspect, exist Above in the description of the exemplary embodiment of the present invention, each feature of the present invention is grouped together into single enforcement sometimes In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention:I.e. required guarantor The more features of feature that the application claims ratio of shield is expressly recited in each claim.More precisely, such as following Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim itself All as the separate embodiments of the present invention.
Those skilled in the art are appreciated that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more equipment different from the embodiment.Can be the module or list in embodiment Unit or component are combined into a module or unit or component, and can be divided in addition multiple submodule or subelement or Sub-component.In addition at least some in such feature and/or process or unit is excluded each other, can adopt any Combine to all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so disclosed Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification is (including adjoint power Profit is required, summary and accompanying drawing) disclosed in each feature can it is identical by offers, be equal to or the alternative features of similar purpose carry out generation Replace.
Although additionally, it will be appreciated by those of skill in the art that some embodiments described herein include other embodiments In included some features rather than further feature, but the combination of the feature of different embodiments means in of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment required for protection appoint One of meaning can in any combination mode using.
The present invention all parts embodiment can be realized with hardware, or with one or more processor operation Software module realize, or with combinations thereof realization.It will be understood by those of skill in the art that can use in practice During microprocessor or digital signal processor (DSP) are to realize application searches method and apparatus according to embodiments of the present invention The some or all functions of some or all parts.The present invention is also implemented as performing method as described herein Some or all equipment or program of device (for example, computer program and computer program).Such reality The program of the existing present invention can be stored on a computer-readable medium, or can have the form of one or more signal. Such signal can be downloaded from internet website and obtained, or be provided on carrier signal, or in any other form There is provided.
It should be noted that above-described embodiment the present invention will be described rather than limits the invention, and ability Field technique personnel can design without departing from the scope of the appended claims alternative embodiment.In the claims, Any reference markss between bracket should not be configured to limitations on claims.Word "comprising" is not excluded the presence of not Element listed in the claims or step.Word "a" or "an" before element does not exclude the presence of multiple such Element.The present invention can come real by means of the hardware for including some different elements and by means of properly programmed computer It is existing.If in the unit claim for listing equipment for drying, several in these devices can be by same hardware branch To embody.The use of word first, second, and third does not indicate that any order.These words can be explained and be run after fame Claim.

Claims (10)

1. a kind of application searches method, wherein, including:
Build the label system of each application;
Receive the search word of client upload;
Matched in the label system of each application according to the search word;
When the key word in the search word and the label system of an application matches, the relevant information of the application is returned It is shown to client.
2. the method for claim 1, wherein the label system for building each application includes:
Obtain the summary of each application;
The search word with regard to each application is obtained from application searches daily record;
Summary, search word and preset strategy according to each application, excavates the label system of each application.
3. method as claimed in claim 1 or 2, wherein, the summary according to each application, search word and preset strategy, Excavating the label system of each application includes:
According to the summary and search word of each application, corpus set is obtained;
Corpus set is input into into LDA models and is trained, obtain the application-theme probability distribution of LDA models output And theme-key words probabilities distribution results as a result;
According to the application-theme probability distribution result and the theme-key words probabilities distribution results, each application is calculated Label system.
4. the method as any one of claim 1-3, wherein, the summary and search word according to each application is obtained Corpus set includes:
For each application, the word of first section word or front predetermined number sentence is extracted from the summary of the application;To extract Original language material of the search word of the word that goes out and the application collectively as the application;
The original language material of each application constitutes original language material set;Pretreatment is carried out to the original language material set, training language is obtained Material set.
5. the method as any one of claim 1-4, wherein, it is described that pretreatment bag is carried out to the original language material set Include:
In the original language material set,
For each original language material, word segmentation processing is carried out to the original language material, obtain the word segmentation result comprising multiple lexical items;Look into The phrase that adjacent lexical item in looking for by the word segmentation result is constituted;Retain in the phrase, the word segmentation result and belong to noun Lexical item and the lexical item for belonging to verb, as the key word that original language material correspondence retains.
6. a kind of application searches device, wherein, including:
Label system construction unit, is suitable to build the label system of each application;
Interactive unit, is suitable to receive the search word of client upload;
Search processing, is suitable to be matched in the label system of each application according to the search word;
The interactive unit, is further adapted for when the key word in the search word and the label system of an application matches, will The relevant information of the application is back to client and is shown.
7. device as claimed in claim 6, wherein, the label system construction unit includes:
Information acquisition unit, is suitable to obtain the summary of each application;And obtain searching with regard to each application from application searches daily record Rope word;
Unit is excavated using label, summary according to each application, search word and preset strategy is suitable to, the mark of each application is excavated Label system.
8. device as claimed in claims 6 or 7, wherein,
The application label excavates unit, is suitable to the summary and search word according to each application, obtains corpus set;Will training Language material set is input into into LDA models and is trained, and obtains application-theme probability distribution result and the master of the output of LDA models Topic-key words probabilities distribution results;According to the application-theme probability distribution result and the theme-key words probabilities distribution As a result, it is calculated the label system of each application.
9. the device as any one of claim 6-8, wherein,
The application label excavates unit, is suitable to, for each is applied, first section word or front pre- be extracted from the summary of the application If the word of quantity sentence;By the search word of the word for extracting and the application collectively as the application original language material;Respectively Using original language material constitute original language material set;Pretreatment is carried out to the original language material set, corpus set is obtained.
10. the device as any one of claim 6-9, wherein,
The application label excavates unit, is suitable in the original language material set, for each original language material, to described original Language material carries out word segmentation processing, obtains the word segmentation result comprising multiple lexical items;Adjacent lexical item structure in searching by the word segmentation result Into phrase;Retain and belong to the lexical item of noun in the phrase, the word segmentation result and belong to the lexical item of verb, it is original as this The key word that language material correspondence retains.
CN201611229802.5A 2016-12-27 2016-12-27 Application search method and device Expired - Fee Related CN106682170B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611229802.5A CN106682170B (en) 2016-12-27 2016-12-27 Application search method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611229802.5A CN106682170B (en) 2016-12-27 2016-12-27 Application search method and device

Publications (2)

Publication Number Publication Date
CN106682170A true CN106682170A (en) 2017-05-17
CN106682170B CN106682170B (en) 2020-09-18

Family

ID=58871714

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611229802.5A Expired - Fee Related CN106682170B (en) 2016-12-27 2016-12-27 Application search method and device

Country Status (1)

Country Link
CN (1) CN106682170B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291962A (en) * 2017-08-10 2017-10-24 广东欧珀移动通信有限公司 searching method, device, storage medium and electronic equipment
CN107613520A (en) * 2017-08-29 2018-01-19 重庆邮电大学 A kind of telecommunication user similarity based on LDA topic models finds method
CN108038192A (en) * 2017-12-11 2018-05-15 广东欧珀移动通信有限公司 Application searches method and apparatus, electronic equipment, computer-readable recording medium
CN108762804A (en) * 2018-04-24 2018-11-06 阿里巴巴集团控股有限公司 The method and apparatus that gray scale issues new product
CN109800348A (en) * 2018-12-12 2019-05-24 平安科技(深圳)有限公司 Search for information display method, device, storage medium and server
CN111221928A (en) * 2018-11-27 2020-06-02 上海擎感智能科技有限公司 Thematic map display method and vehicle-mounted terminal
CN112052330A (en) * 2019-06-05 2020-12-08 上海游昆信息技术有限公司 Application keyword distribution method and device
CN113609380A (en) * 2021-07-12 2021-11-05 北京达佳互联信息技术有限公司 Label system updating method, searching method, device and electronic equipment
CN114168837A (en) * 2021-11-18 2022-03-11 深圳市梦网科技发展有限公司 Chatbot searching method, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350977A (en) * 2007-07-20 2009-01-21 宁波萨基姆波导研发有限公司 Rapid searching method for mobile communication terminal
CN102760149A (en) * 2012-04-05 2012-10-31 中国人民解放军国防科学技术大学 Automatic annotating method for subjects of open source software
CN104133877A (en) * 2014-07-25 2014-11-05 百度在线网络技术(北京)有限公司 Software label generation method and device
CN104281656A (en) * 2014-09-18 2015-01-14 广州三星通信技术研究有限公司 Method and device for adding label information into application program
CN105893609A (en) * 2016-04-26 2016-08-24 南通大学 Mobile APP recommendation method based on weighted mixing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350977A (en) * 2007-07-20 2009-01-21 宁波萨基姆波导研发有限公司 Rapid searching method for mobile communication terminal
CN102760149A (en) * 2012-04-05 2012-10-31 中国人民解放军国防科学技术大学 Automatic annotating method for subjects of open source software
CN104133877A (en) * 2014-07-25 2014-11-05 百度在线网络技术(北京)有限公司 Software label generation method and device
CN104281656A (en) * 2014-09-18 2015-01-14 广州三星通信技术研究有限公司 Method and device for adding label information into application program
CN105893609A (en) * 2016-04-26 2016-08-24 南通大学 Mobile APP recommendation method based on weighted mixing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
OM P. DAMANI等: "Appropriately Incorporating Statistical Significance in PMI", 《PROCEEDINGS OF THE 2013 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING》 *
李湘东等: "一种基于加权LDA模型和多粒度的文本特征选择方法", 《现代图书情报技术》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291962B (en) * 2017-08-10 2020-06-26 Oppo广东移动通信有限公司 Searching method, searching device, storage medium and electronic equipment
CN107291962A (en) * 2017-08-10 2017-10-24 广东欧珀移动通信有限公司 searching method, device, storage medium and electronic equipment
CN107613520A (en) * 2017-08-29 2018-01-19 重庆邮电大学 A kind of telecommunication user similarity based on LDA topic models finds method
CN107613520B (en) * 2017-08-29 2020-08-04 重庆邮电大学 Telecommunication user similarity discovery method based on L DA topic model
CN108038192A (en) * 2017-12-11 2018-05-15 广东欧珀移动通信有限公司 Application searches method and apparatus, electronic equipment, computer-readable recording medium
CN108762804A (en) * 2018-04-24 2018-11-06 阿里巴巴集团控股有限公司 The method and apparatus that gray scale issues new product
CN108762804B (en) * 2018-04-24 2021-11-19 创新先进技术有限公司 Method and device for gray-scale releasing new product
CN111221928A (en) * 2018-11-27 2020-06-02 上海擎感智能科技有限公司 Thematic map display method and vehicle-mounted terminal
CN111221928B (en) * 2018-11-27 2024-02-23 上海擎感智能科技有限公司 Thematic map display method and vehicle-mounted terminal
CN109800348A (en) * 2018-12-12 2019-05-24 平安科技(深圳)有限公司 Search for information display method, device, storage medium and server
CN112052330A (en) * 2019-06-05 2020-12-08 上海游昆信息技术有限公司 Application keyword distribution method and device
CN112052330B (en) * 2019-06-05 2021-11-26 上海游昆信息技术有限公司 Application keyword distribution method and device
CN113609380A (en) * 2021-07-12 2021-11-05 北京达佳互联信息技术有限公司 Label system updating method, searching method, device and electronic equipment
CN113609380B (en) * 2021-07-12 2024-03-26 北京达佳互联信息技术有限公司 Label system updating method, searching device and electronic equipment
CN114168837A (en) * 2021-11-18 2022-03-11 深圳市梦网科技发展有限公司 Chatbot searching method, equipment and storage medium

Also Published As

Publication number Publication date
CN106682170B (en) 2020-09-18

Similar Documents

Publication Publication Date Title
CN106682169A (en) Application label mining method and device, and application searching method and server
CN106682170A (en) Application searching method and device
CN103020845B (en) A kind of method for pushing and system of mobile application
CN104216875B (en) Automatic microblog text abstracting method based on unsupervised key bigram extraction
CN103853824B (en) In-text advertisement releasing method and system based on deep semantic mining
CN104809632B (en) A kind of generation method and device of the dynamic advertising based on template
CN109726274B (en) Question generation method, device and storage medium
CN104394057B (en) Expression recommends method and device
JP5647508B2 (en) System and method for identifying short text communication topics
CN103530751B (en) The method and device of waybill is provided
CN106649818A (en) Recognition method and device for application search intentions and application search method and server
CN106970991A (en) Recognition methods, device and the application searches of similar application recommend method, server
CN107169010A (en) A kind of determination method and device of recommendation search keyword
CN101299217B (en) Method, apparatus and system for processing map information
CN110321291A (en) Test cases intelligent extraction system and method
CN102236677A (en) Question answering system-based information matching method and system
CN101350154A (en) Method and apparatus for ordering electronic map data
CN102163234A (en) Equipment and method for error correction of query sequence based on degree of error correction association
CN103853722A (en) Query based keyword extension method, device and system
CN110929045B (en) Construction method and system of poetry-semantic knowledge map
CN106682152A (en) Recommendation method for personalized information
CN103106287A (en) Processing method and processing system for retrieving sentences by user
CN110750975B (en) Introduction text generation method and device
CN103198072A (en) Method and device for mining and recommendation of popular search word
CN105868267A (en) Modeling method for mobile social network user interests

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200918