CN106951548B

CN106951548B - Method and system for improving close-up word searching precision based on RM algorithm

Info

Publication number: CN106951548B
Application number: CN201710189291.7A
Authority: CN
Inventors: 陈刚; 曾明; 宋涛; 李京
Original assignee: Julong Fusion Technologies Co ltd
Current assignee: Julong Fusion Technologies Co ltd
Priority date: 2017-03-27
Filing date: 2017-03-27
Publication date: 2020-07-17
Anticipated expiration: 2037-03-27
Also published as: CN106951548A

Abstract

The disclosure provides a method and a system for improving close-up word searching precision based on an RM algorithm, and electronic equipment. The method for improving the close-up word searching precision based on the RM algorithm comprises the following steps: constructing a custom word segmentation library according to the web crawler data acquired by data acquisition and the internal data of the enterprise; performing word segmentation on the received query sentence by adopting various preset word segmentation algorithms to obtain a plurality of groups of word segmentation results; respectively searching by adopting the multi-group word segmentation results, and calculating the search score of each group; and correcting the user-defined word segmentation library according to the search scores of all the groups and the corresponding word segmentation results. The technical scheme of the invention can effectively improve the word segmentation precision and the searching accuracy of the specific service scene.

Description

Method and system for improving close-up word searching precision based on RM algorithm

Technical Field

The invention relates to the technical field of searching, in particular to a method and a system for improving close-up word searching precision based on an RM algorithm.

Background

Most of the existing search engines can only provide standard search services, the word segmentation result basically depends on training of search data of a web crawler tool, and reasonable and accurate word segmentation cannot be carried out in a specific field; and the complexity of application integration is high, and the workload of constructing indexes is large.

Large-scale internet application and analysis and statistics systems need search service support, and the search service provides better information collection experience for users and becomes an indispensable system component; certain fields require an increase in personalized search services as well as search accuracy.

Therefore, a new method and system for improving the search precision of close-up words based on the RM algorithm are needed.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present invention and therefore may include information that does not constitute prior art known to a person of ordinary skill in the art.

Disclosure of Invention

The invention aims to provide a method and a system for improving close-up word searching precision based on an RM algorithm, so as to overcome one or more problems caused by the limitations and defects of the related art at least to a certain extent.

Additional features and advantages of the invention will be set forth in the detailed description which follows, or may be learned by practice of the invention.

According to one aspect of the disclosure, a method for improving close-up word searching precision based on an RM algorithm is provided, and comprises the following steps:

constructing a custom word segmentation library according to the web crawler data acquired by data acquisition and the internal data of the enterprise;

performing word segmentation on the received query sentence by adopting various preset word segmentation algorithms to obtain a plurality of groups of word segmentation results;

respectively searching by adopting the multi-group word segmentation results, and calculating the search score of each group;

and correcting the user-defined word segmentation library according to the search scores of all the groups and the corresponding word segmentation results.

In an exemplary embodiment of the present disclosure, the method further comprises: a search platform based on a search engine is constructed, and search services and a framework are provided.

In an exemplary embodiment of the present disclosure, the intra-enterprise data includes intra-enterprise log data and business data.

In an exemplary embodiment of the present disclosure, constructing the custom segmentation library according to the web crawler data and the internal data of the enterprise acquired by data acquisition includes:

acquiring an open source general word bank according to the web crawler data;

acquiring specific vocabularies of corresponding business scenes according to the internal data of the enterprise;

and merging the open-source general word bank and the specific vocabulary to obtain the user-defined word bank.

In an exemplary embodiment of the present disclosure, the plurality of preset word segmentation algorithms include: two or three of a maximum forward matching method, a reverse maximum matching method and a shortest path method.

In an exemplary embodiment of the present disclosure, the multiple groups of segmented word results are used to perform searches respectively, and a calculation formula for calculating the search score of each group is as follows:

RM(A)＝RM(a1a2…am)＝RM(a1)*P(a1)+RM(a2)*P(a2)+…RM(am)*P(am)

wherein, A is the query statement, RM (A) is the search score, a1a2 … am is a group of word segmentation results, RM (ai) is the score obtained according to the search engine, P (ai) represents the weight of the corresponding phrase, and i is more than or equal to 1 and less than or equal to m.

In an exemplary embodiment of the present disclosure, the performing searches respectively using the multi-group segmented word results, and calculating the search score of each group includes:

initially setting the weight P (ai) of each phrase to be 1/m;

after the search score is obtained, receiving feedback information;

and adjusting the weight of the corresponding phrase according to the feedback information, and recalculating the search score.

In an exemplary embodiment of the present disclosure, the customized segmentation library is modified according to the search scores of each group and the corresponding segmentation results:

sorting in descending order according to the search scores of each group;

and comparing the word group with the highest score of the word segmentation result with the entries in the user-defined word segmentation library, and correcting the corresponding entries in the user-defined word segmentation library.

According to one aspect of the present disclosure, there is provided a system for improving close-up word search accuracy based on an RM algorithm, comprising:

the user-defined word segmentation library construction module is used for constructing a user-defined word segmentation library according to the web crawler data acquired by data acquisition and the internal data of the enterprise;

the word segmentation module is used for segmenting words of the received query sentence by adopting various preset word segmentation algorithms to obtain a plurality of groups of word segmentation results;

the search score calculation module is used for respectively searching by adopting the multi-group word segmentation results and calculating the search score of each group;

and the correction module is used for correcting the user-defined word segmentation library according to the search scores of all the groups and the corresponding word segmentation results.

According to an aspect of the present disclosure, there is provided an electronic device including:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of improving close-up word search accuracy based on the RM algorithm of any one of the above.

In the technical solutions provided in some embodiments of the present invention, on one hand, a custom segmentation library is constructed by collecting internal data of an enterprise and integrating general web crawler data, and phrases or words in the custom segmentation library can be closer to a specific vocabulary of a specific service scene, which can be beneficial to improving the accuracy of a segmentation result. On the other hand, various word segmentation algorithms are integrated, and the word segmentation results of each group are respectively searched to obtain the search scores of each group, so that the search accuracy can be further improved, and the search results more suitable for the user are provided. Meanwhile, the user-defined word segmentation library is corrected through the search scores of all the groups and the corresponding word segmentation results, and the precision of the word segmentation results can be further improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

FIG. 1 schematically illustrates a flow diagram of a method of improving close-up word search accuracy based on RM algorithms, in accordance with an embodiment of the present invention;

FIG. 2 is a system diagram schematically illustrating a method for improving close-up word search accuracy based on RM algorithm according to an embodiment of the present invention;

FIG. 3 schematically illustrates a block diagram of a system for enhancing the precision of close-up word searches based on RM algorithms, in accordance with an embodiment of the present invention.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

In view of the above existing needs and deficiencies of the technology, the embodiments of the present disclosure provide a method and a system for improving close-up word search precision based on an RM algorithm, so as to solve the problems of low segmentation accuracy and low search precision in specific industries and fields of the existing search technology.

FIG. 1 schematically shows a flow diagram of a method for improving close-up word search accuracy based on RM algorithms, in accordance with an embodiment of the present invention.

As shown in fig. 1, in step S100, a custom thesaurus is constructed according to the web crawler data and the internal data of the enterprise obtained by data acquisition.

In an exemplary embodiment, the method may further include: a search platform based on a search engine is constructed, and search services and a framework are provided.

In an exemplary embodiment, the intra-enterprise data may include intra-enterprise log data and business data.

In an exemplary embodiment, constructing the custom thesaurus according to the web crawler data and the internal data of the enterprise obtained by data acquisition may further include: acquiring an open source general word bank according to the web crawler data; acquiring specific vocabularies of corresponding business scenes according to the internal data of the enterprise; and merging the open-source general word bank and the specific vocabulary to obtain the user-defined word bank.

In the embodiment of the invention, data acquisition can be carried out by constructing the efficient crawler extraction plug-in. By means of an efficient deployment mode, network and log data are acquired regularly and in batches through plug-ins. However, data directly extracted by the plug-in is unstructured and unusable, and needs to be further processed, so that the MapReduce of hadoop can be introduced into the processing process for processing, and the effect of high efficiency and even timely updating is realized.

In the embodiment of the invention, the collected data can be analyzed, filtered, summarized, stored and the like by adopting the data warehouse. The data warehouse is used for storing and analyzing the data collected in the previous step and preparing for the next program processing.

In the embodiment of the invention, data are searched through a web crawler, and data of specific data (such as internal data of an enterprise) are collected as data sources to be trained. The training can be carried out through an intelligent training platform, the specific application scene of the application is reflected, and the universal training result cannot meet the requirement of precision under certain search services. After training, a custom thesaurus comprising the custom entries can be obtained, and the custom thesaurus can be used for correcting the search result.

For example, a certain enterprise internal system does not need to perform word segmentation on the query sentence of "three convenience stores", and if the query sentence is divided into two phrases of "three convenience stores" and "convenience stores" according to a common word segmentation algorithm to perform a search, such word segmentation result is not a real meaning expression of the user, so that the finally returned search result is not necessarily desired by the client. Such a business scenario can only be handled by custom entries. For another example, a combination with a specific meaning, such as "wangsaidi supermarket", cannot be subjected to conventional word segmentation in searching, and is the name of a shop in a specific business. While the conventional word segmentation can be used for segmenting a supermarket, a wang xiao bi, a king and a xiao bi. It is not possible to split in a specific service. In the embodiment of the invention, whether the word segmentation is needed or not is judged according to the relation of the two words of 'Wangxiedi' and 'supermarket'. The specific calculation method of the relationship between words is judged by calculation according to information extracted from data sources such as logs of specific services and initialization setting.

In the embodiment of the invention, the training method can judge the searching accuracy through the record and the access history of the user search, and compares the word segmentation result with the user-defined word segmentation library, thereby gradually correcting the user-defined word segmentation library.

In the embodiment of the present invention, the specific process of constructing the custom thesaurus may include: defining word segmentation rules according to a word segmentation acquisition algorithm; and constructing a word segmentation library, merging word segmentation rules and storing the word segmentation rules.

For example, the thesaurus may include a thesaurus commonly used in open sources, and also include a part of thesaurus of specific services, and when merging, there may be a cross problem, which requires word segmentation rules for aggregation. And when the word segmentation library is constructed, converging and segmenting word rules.

In step S102, a plurality of preset word segmentation algorithms are used to segment the received query sentence, and a multi-group word segmentation result is obtained.

In an exemplary embodiment, the plurality of preset word segmentation algorithms may include: two or three of a maximum forward matching method, a reverse maximum matching method and a shortest path method.

For example, it is assumed that the query sentence input by the user is "not know what you are saying", and the first set of participles obtained by using the forward maximum matching method is "not know what you are saying".

After word segmentation, P (A1a2 … Aq) ([ A1, a2 …. aj ]

Wherein, a is the input query statement, a1 to Aq are q characters included in the query statement, and a1 to aj represent j phrases or characters after word segmentation.

Similarly, in the above query sentence, "do not know what you are saying" as an example, the result of the second component word obtained by using the inverse maximum matching method is "do not know what you are saying," there are more word groups or words or first component word results for this component word, and the inverse maximum matching method is word segmentation from right to left.

The shortest path word segmentation of 'what you are saying' is not known, the words segmented by the words are the least, the result of obtaining the third component word is 'what you are saying', and the segmentation is only 3 words.

In an exemplary embodiment, the three word segmentation methods can be adopted, entry word segmentation is performed according to common dic constructed by a crawler, and a word segmentation result is compared with custom dic constructed by a data acquisition platform, so that a word segmentation result is corrected, and a custom word segmentation library according to the following scoring steps is obtained.

In step S104, the multi-group word segmentation results are used to perform a search respectively, and a search score of each group is calculated.

In an exemplary embodiment, the calculation formula for performing the search respectively by using the multi-component word results and calculating the search score of each group may be as follows:

RM(A)＝RM(a1a2…am)＝RM(a1)*P(a1)+RM(a2)*P(a2)+…RM(am)*P(am)

Rm (ai) in the above formula can be obtained according to a search engine such as an elastic search, that is, conventional word segmentation and score obtained by search, and an open-source search engine provides general search logic.

The splitting method depends on the three word segmentation methods, and the method for calculating the search score depends on the calculation of P (ai). Where the results of P (ai) relate to the frequency of ai's occurrence in the searched documents, the rarity, synonyms, hypernyms, etc. For example, low frequency words have high matching degree, and the weight p (ai) is large; the high frequency word matching degree is low, and the weight P (ai) is smaller. The present disclosure is not limited thereto.

The maximum matching method, the reverse maximum matching method, and the shortest path word segmentation method are also taken as examples.

The search score of the first group of participle results is calculated by the formula:

RM (not knowing what you are saying) P (not knowing) + RM (you) P (you) + RM (in) P (in) + RM (what you are saying) P (saying)

The search score of the second set of participle results is calculated by the formula:

RM (not knowing what you are saying) RM (not, knowing what you are saying) P (not) + RM (knowing) P (knowing) + RM (you are) P (you are saying) + RM (what) P (saying)

The search score calculation formula of the third component word result is as follows:

RM (don't know what you are saying) P (don't know) + RM (you are saying) P (you are saying) + RM (saying) P (saying what)

In an exemplary embodiment, the performing a search using the multi-component word results and calculating a search score for each group may further include: initially setting the weight P (ai) of each phrase to be 1/m; after the search score is obtained, receiving feedback information; and adjusting the weight of the corresponding phrase according to the feedback information, and recalculating the search score.

In the embodiment of the present invention, the initial weight of each phrase is set to 1/4, and the ratio is modified when the modification is made.

Under the condition that the initial weight is 1/4, a word segmentation method, such as a maximum forward matching method, is determined according to the search score, a custom word segmentation library of the entry of the word segmentation result is formed, and then the adjustment of the weight is determined according to the feedback of the business. For example, when the user searches for goods, the feedback accuracy is biased, and a manual intervention may be performed to adjust the weight of a certain entry according to, for example, the frequency of the entry in the business, etc. After the search score is obtained, the term weight is adjusted according to the service scene, wherein 0< p (ai) < 1.

And after the weight is adjusted, recalculating the RM search score according to the adjusted weight, reordering, and selecting the group of word segmentation results with the highest score after reordering as the word segmentation results of the current query sentence.

In step S106, the customized segmentation library is modified according to the search scores of each group and the corresponding segmentation results.

In an exemplary embodiment, the customized segmentation library is modified according to the search scores of each group and the corresponding segmentation results: sorting in descending order according to the search scores of each group; and comparing the word group with the highest score of the word segmentation result with the entries in the user-defined word segmentation library, and correcting the corresponding entries in the user-defined word segmentation library.

In the embodiment of the invention, the most appropriate word segmentation method of the current query sentence is determined according to the search result, and the word segmentation result of the word segmentation method is added into the user-defined word segmentation library.

Fig. 2 is a system structural diagram schematically illustrating a method for improving close-up word search accuracy based on an RM algorithm according to an embodiment of the present invention.

As shown in fig. 2, a search platform based on a search engine, such as an ElasticSearch, is constructed, providing basic search services and a framework. In addition, a data acquisition platform is also constructed, data acquisition is mainly carried out through web crawler crawling, enterprise internal log data and a specific data mode, and data are summarized, classified and filtered effectively. Then, RM algorithms are used for data analysis and training word segmentation.

A particular data pattern here refers to a log such as a database, which is a source of data that needs to be treated separately, specially processed. Some data are business data, so that the effectiveness is high; some log data are low in effectiveness and need to be subjected to weight division, so that more accurate word segmentation basis is improved.

The method for improving the close-up word searching precision based on the RM algorithm increases purposeful screening and filtering of the specific vocabulary of the specific service scene on the basis of the existing searching algorithm, and calculates the association degree of the vocabulary and the specific vocabulary so as to judge whether the conventional word segmentation is needed or not, so that the word segmentation precision is improved.

As shown in fig. 3, the system 10 for improving close-up word search precision based on RM algorithm may include a custom segmentation word library building module 100, a segmentation word module 102, a search score calculation module 104, and a modification module 106.

The custom thesaurus building module 100 can be used for building a custom thesaurus according to the web crawler data and the internal data of the enterprise, which are acquired by data acquisition.

The word segmentation module 102 may be configured to perform word segmentation on the received query sentence by using a plurality of preset word segmentation algorithms to obtain a plurality of sets of word segmentation results.

The search score calculation module 104 may be configured to perform a search using the multi-component word results, and calculate a search score for each group.

The modification module 106 may be configured to modify the custom thesaurus according to the search scores of each group and the corresponding word segmentation results.

In an exemplary embodiment, the system 10 may further include: and the building module can be used for building a search platform based on a search engine and providing search services and a framework.

In an exemplary embodiment, the custom thesaurus building module 100 may further include: the universal word bank obtaining unit is used for obtaining an open-source universal word bank according to the web crawler data; the specific vocabulary acquisition unit is used for acquiring specific vocabularies of corresponding service scenes according to the internal data of the enterprise; and the merging unit is used for merging the open-source general word bank and the specific vocabulary to obtain the user-defined word bank.

In an exemplary embodiment, the plurality of preset word segmentation algorithms includes: two or three of a maximum forward matching method, a reverse maximum matching method and a shortest path method.

In an exemplary embodiment, the search score calculation module 104 performs a search using the multi-component word results, and calculates the search score of each group according to the following calculation formula:

RM(A)＝RM(a1a2…am)＝RM(a1)*P(a1)+RM(a2)*P(a2)+…RM(am)*P(am)

In an exemplary embodiment, the search score calculation module 104 may further include: an initialization unit for initially setting the weight p (ai) of each phrase to 1/m; a feedback information receiving unit for receiving feedback information after the search score is obtained; and the score recalculation unit is used for adjusting the weight of the corresponding phrase according to the feedback information and recalculating the search score.

In an exemplary embodiment, the modification module 106 may further include: a sorting unit for performing descending sorting according to the search scores of the groups; and the correcting unit is used for comparing the word group with the highest score of the word segmentation result with the entry in the user-defined word segmentation library and correcting the corresponding entry in the user-defined word segmentation library.

For specific implementation of each component module and/or unit in the system for improving close-up word search accuracy based on the RM algorithm in the embodiment of the present invention, reference may be made to the above method embodiment, and details are not described here.

Further, the embodiment of the present disclosure also provides an electronic device, which may include a processor and a memory. The memory may be used to store executable instructions for the processor. Wherein the processor is configured to execute the method for improving the close-up word searching precision based on the RM algorithm in any one of the above embodiments.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the invention. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiment of the present invention.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method for improving close-up word searching precision based on RM algorithm is characterized by comprising the following steps:

respectively searching by adopting the multi-group word segmentation results, and calculating the search score of each group; the calculation formula for respectively searching by adopting the multi-group word segmentation results and calculating the search scores of each group is as follows: RM (a) (RM (a1a2 … am) ═ RM (a1) P (a1) + RM (a2) P (a2) + … RM (am) P (am); wherein, A is the query statement, RM (A) is search score, a1a2 … am is a group of word segmentation results, RM (ai) is score obtained according to the search engine, P (ai) represents weight of corresponding phrase, and i is more than or equal to 1 and less than or equal to m;

initially setting the weight P (ai) of each phrase to be 1/m; after the search score is obtained, receiving feedback information; adjusting the weight of the corresponding phrase according to the feedback information and then recalculating the search score;

2. The method of claim 1, further comprising: a search platform based on a search engine is constructed, and search services and a framework are provided.

3. The method of claim 1, wherein the intra-enterprise data comprises intra-enterprise log data and business data.

4. The method of claim 1, wherein constructing the custom segmentation library from the web crawler data and the enterprise internal data obtained by data acquisition comprises:

acquiring an open source general word bank according to the web crawler data;

5. The method of claim 1, wherein the plurality of preset word segmentation algorithms comprises: two or three of a maximum forward matching method, a reverse maximum matching method and a shortest path method.

6. The method of claim 1, wherein the custom thesaurus is modified according to the search scores of each group and the corresponding word segmentation results:

sorting in descending order according to the search scores of each group;

7. A system for improving close-up word search accuracy based on RM algorithm is characterized by comprising:

the search score calculation module is used for respectively searching by adopting the multi-group word segmentation results and calculating the search score of each group; the calculation formula for respectively searching by adopting the multi-group word segmentation results and calculating the search scores of each group is as follows: RM (a) (RM (a1a2 … am) ═ RM (a1) P (a1) + RM (a2) P (a2) + … RM (am) P (am); wherein, A is the query statement, RM (A) is search score, a1a2 … am is a group of word segmentation results, RM (ai) is score obtained according to the search engine, P (ai) represents weight of corresponding phrase, and i is more than or equal to 1 and less than or equal to m;

8. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of improving close-up word search accuracy based on the RM algorithm of any of the preceding claims 1-6.