CN110633352A - Semantic retrieval method and device - Google Patents

Semantic retrieval method and device Download PDF

Info

Publication number
CN110633352A
CN110633352A CN201810554080.3A CN201810554080A CN110633352A CN 110633352 A CN110633352 A CN 110633352A CN 201810554080 A CN201810554080 A CN 201810554080A CN 110633352 A CN110633352 A CN 110633352A
Authority
CN
China
Prior art keywords
word
segmentation
words
word segmentation
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810554080.3A
Other languages
Chinese (zh)
Inventor
胡娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Didi Infinity Technology and Development Co Ltd
Original Assignee
Beijing Didi Infinity Technology and Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Didi Infinity Technology and Development Co Ltd filed Critical Beijing Didi Infinity Technology and Development Co Ltd
Priority to CN201810554080.3A priority Critical patent/CN110633352A/en
Priority to PCT/CN2019/081444 priority patent/WO2019228065A1/en
Publication of CN110633352A publication Critical patent/CN110633352A/en
Priority to US17/093,664 priority patent/US20210089531A1/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a semantic retrieval method and a semantic retrieval device, wherein the method comprises the following steps: and acquiring a word segmentation word list and a text input by a user. And performing word segmentation on the text according to the word segmentation word list, and determining a first word segmentation result. And searching the text according to the first word segmentation result. The method improves the semantic retrieval capability of the system and can effectively search the text.

Description

Semantic retrieval method and device
[ technical field ] A method for producing a semiconductor device
The present disclosure relates to semantic retrieval methods and apparatuses, and particularly, to a semantic retrieval method and an apparatus for a trip field.
[ background of the invention ]
In the process of address searching, the condition that the search result is inaccurate may occur, which directly influences the search experience of the user. At present, the common practice is to use a common vocabulary to perform word segmentation on a text input by a user, and search for contents in which the user is interested according to word segmentation results. The method has the defects that the accuracy of a search result is low and the user experience is poor due to the lack of pertinence of the universal word list.
[ summary of the invention ]
Aiming at the problem of inaccurate search results, the invention aims to provide a more accurate and effective semantic retrieval method.
In order to achieve the purpose of the invention, the technical scheme provided by the invention is as follows:
a semantic retrieval method comprises the steps of obtaining a word segmentation word list; acquiring a text input by a user; performing word segmentation on the text according to a word segmentation word list, and determining a first word segmentation result; and searching the text according to the first word segmentation result.
In the present invention, the first segmentation result includes a fine-grained segmentation result and a coarse-grained segmentation result.
In the present invention, the first word segmentation result further includes a combination of a plurality of words having a probability of simultaneous occurrence greater than a set threshold.
The invention discloses a method for generating a word list of participles, which comprises the steps of obtaining a word list model; acquiring a training corpus; performing word segmentation on the training corpus according to the word list model to obtain a second word segmentation result; and determining a word segmentation word list according to the second word segmentation result.
In the present invention, said determining a word segmentation table according to the second word segmentation result further comprises: and determining a word segmentation word list in a multi-round iteration mode.
In the invention, each iteration of the multiple iterations further comprises performing preliminary word segmentation according to the vocabulary model and the training corpus to determine preliminary words; acquiring a preset rule; judging whether the preliminary words meet preset rules or not; and when the initial word is in accordance with a preset rule, setting the initial word to be added into the word list model, generating a new word list model, and performing the next word segmentation.
In the invention, the generation method of the word segmentation word list further comprises the steps of obtaining a user log, wherein the user log comprises search words input by a user or search results selected by the user; and determining a new word according to the user log.
In the invention, the determining of the word segmentation word list further comprises obtaining word characteristics according to the word list model and the training corpus, wherein the word characteristics comprise word cohesion, word freedom and/or user word use habit; determining a new word according to the word characteristic; and adding the new words into the word segmentation word list.
In the present invention, the word segmentation table may be a point of interest word segmentation table.
A semantic retrieval device comprises a first acquisition module, a second acquisition module and a semantic retrieval module, wherein the first acquisition module is used for acquiring a word segmentation word list and a text input by a user; the device comprises a first word segmentation module and a search module, wherein the first word segmentation module is used for segmenting the text according to a word segmentation word list and determining a word segmentation result, and the search module is used for searching the text according to the word segmentation result.
The word segmentation and vocabulary generation device further comprises a word segmentation and vocabulary generation module, and further comprises a second acquisition module, a second generation module and a second generation module, wherein the second acquisition module is used for acquiring a vocabulary model and acquiring training corpora; the second word segmentation module is used for segmenting words of the training corpus according to the word list model and determining word segmentation results, and the determination module is used for determining the word segmentation word list according to the word segmentation results.
Compared with the prior art, the invention has the following beneficial effects:
firstly, a word segmentation word list is generated aiming at a certain interest point of a user, and then words are segmented according to the word segmentation word list, so that the accuracy of searching products is improved;
and secondly, because the word segmentation model carries out word segmentation on the text input by the user according to the word segmentation word list, the manual operation is less and the time is short.
[ description of the drawings ]
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments will be briefly introduced below. It is obvious that the drawings in the following description are only some embodiments of the invention, and it is obvious for a person skilled in the art that the invention can also be applied to other similar scenarios according to these drawings without inventive effort. Unless otherwise apparent from the context, or apparent from the context, like numerals in the figures represent like structures and operations.
FIG. 1 is a schematic diagram of a network environment of a semantic retrieval system according to some embodiments of the present application.
FIG. 2 shows the structure of a computer that may implement certain systems disclosed herein;
FIG. 3 shows the structure of a mobile device that may implement certain systems disclosed herein;
FIG. 4 is an exemplary flow diagram of a semantic retrieval method according to some embodiments of the present application;
FIG. 5 is a block diagram of a semantic retrieval device according to some embodiments of the present application;
FIG. 6 is an exemplary flow chart of a method of obtaining a participle vocabulary according to some embodiments of the present application, an
FIG. 7 is a block diagram of a participle word list apparatus according to some embodiments of the present application.
[ detailed description ] embodiments
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, specific embodiments accompanied with figures and examples are described in detail below.
As used in this application and in the claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are inclusive in the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are specifically identified and do not constitute an exclusive list, and that a method or apparatus may include other steps or elements.
Although various references are made herein to certain modules in a system according to embodiments of the present application, any number of different modules may be used and run on a client and/or server. The modules are merely illustrative and different aspects of the systems and methods may use different modules.
Flow charts are used herein to illustrate operations performed by a system according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.
Embodiments of the present application may be applied to different transportation systems including, but not limited to, one or a combination of land, sea, aviation, aerospace, and the like. For example, taxis, special cars, windmills, buses, trains, railcars, subways, ships, airplanes, airships, hot air balloons, unmanned vehicles, receiving/sending couriers, and the like, employ managed and/or distributed transportation systems. The application scenarios of the different embodiments of the present application include, but are not limited to, one or a combination of several of a web page, a browser plug-in, a client, a customization system, an intra-enterprise analysis system, an artificial intelligence robot, and the like. It should be understood that the application scenarios of the system and method of the present application are merely examples or embodiments of the present application, and those skilled in the art can also apply the present application to other similar scenarios without inventive effort based on these drawings. Such as other similar service order systems.
The terms "passenger," "customer," "demander," "service requester," "consumer using demander," and the like, as used herein, are interchangeable, and refer to a party that needs or orders a service, either individually or as a tool. Similarly, "driver," "provider," "supplier," "service provider," "server," "service provider," and the like, as described herein, are also interchangeable and refer to an individual, tool, or other entity, etc., that provides a service or assists in providing a service. In addition, a "user" as described herein may be a party that needs or subscribes to a service, or a party that provides or assists in providing a service.
Fig. 1 illustrates a schematic diagram of a network environment 100, according to some embodiments of the present application. The network environment 100 can include a word segmentation apparatus 105, one or more passenger devices 120, one or more databases 130, one or more driver devices 140, one or more networks 150, and one or more information sources 160. The participle table means 105 may comprise a point of interest (POI) engine 110. In some embodiments, the POI engine 110 may be a system that analyzes and processes the collected information to generate an analysis result, for example, the POI engine may collect training corpora and vocabulary models, and build a word list of the point of interest segmentation applicable to the field of mobile travel. The POI engine 110 may be a server or a group of servers, and the servers in the group are connected via a wired or wireless network. A group of servers may be centralized, such as a data center; a server farm may also be distributed, such as a distributed system. POI engine 110 may be centralized or distributed.
The passenger side 120 and driver side 140, which may be collectively referred to as users, may be individuals, tools, or other entities directly associated with the service order, such as requestors and providers of the service order. The passenger may be a service demander. Herein, "passenger," "passenger end," and "passenger end device" may be used interchangeably. The passenger may also include a user of the passenger-side device 120. In some embodiments, the user may not be the passenger himself. For example, the user a of the passenger-side device 120 may use the passenger-side device 120 to request a trip for the passenger B, or receive the trip or other information or instructions sent by the participle vocabulary device 105. For simplicity, the user of the passenger-side device 120 may also be referred to herein simply as a passenger. The driver may be a service provider. In this context, "driver," "driver end," and "driver end device" may be used interchangeably. The driver may also include a user of the driver-side device 140. In some embodiments, the user may not be the driver himself. For example, the driver C of the driver-side device 140 can use the driver-side device 140 to receive other information or instructions from the trip or the participle vocabulary device 105 for the driver D. For the sake of simplicity, the user of the driver-side device 120 may also be referred to herein simply as the driver. In some embodiments, the passenger terminal 120 may include one or a combination of desktop computers 120-1, laptop computers 120-2, built-in devices 120-3 of the vehicle, mobile devices 120-4, and the like. Further, the built-in device 120-3 of the vehicle may be a vehicle-mounted computer (carputter); the mobile device 120-4 may be one or more of a smart phone, a Personal Digital Assistant (PDA), a tablet computer, a handheld game console, smart glasses, a smart watch, a wearable device, a virtual display device, or a display enhancement device (such as Google Glass, Oculus rise, Hololens, Gear VR), etc. The driver end 140 may also include one or more of similar devices.
The POI engine 110 may access and/or access the data information stored in the database 130 directly, or may access and/or access information at the user terminal 120/140 directly through the network 150. In some embodiments, database 130 may generally refer to a device having storage capabilities. The database 130 is primarily used to store data collected from the passengers 120 and/or drivers 140 and various data utilized, generated, and output by the POI engine 110 in its operation. The database 130 may be local or remote. The connection or communication of the database 130 to the participle word list device 105 or a portion thereof (e.g., the POI engine 110) may be wired or wireless.
The network 150 may be a single network or a combination of multiple different networks. For example, the network 150 may be a Local Area Network (LAN), Wide Area Network (WAN), public network, private network, Public Switched Telephone Network (PSTN), the internet, a wireless network, a virtual network, or any combination thereof. Network 150 may also include a plurality of network access points, e.g., wired or wireless access points such as base station 150-1, base station 150-2, internet exchange points, etc., through which any data source may access network 150 and transmit information through network 150. For convenience of understanding, the driver end 140 in the traffic service is taken as an example for illustration, but the application is not limited to the scope of the embodiment. For example, the driver-side device 140 may be a mobile phone or a tablet computer, and the network environment 100 of the driver-side device 140 may be classified as a wireless network (bluetooth, Wireless Local Area Network (WLAN), Wi-Fi, etc.), a mobile network (2G, 3G, 4G signal, etc.), or other connection methods (virtual private network (VPN)), a shared network, Near Field Communication (NFC), ZigBee, etc.
Information source 160 is a source that provides other information to the system. Information sources 160 may be used to provide information related to services for the system, such as vocabulary models, corpora, and/or user-entered text, among others. The information source 160 may be in the form of a single central server, or may be in the form of a plurality of servers connected via a network, or may be in the form of a large number of personal devices. When the information source exists in the form of a large number of personal devices, these devices can upload text, sound, images, videos, and the like to the cloud server in a user-generated content manner, so that the cloud server forms the information source together with the personal devices connected thereto.
In some embodiments, the communication between the participle word list device 105 and different parts of the network environment 100 may be performed in an order form. The object of the order may be any product. In some embodiments, the product may be a tangible product or an intangible product. A tangible product may be any tangible size or object, such as one or a combination of food, medicine, daily necessities, chemical products, appliances, clothing, automobiles, real estate, luxury goods, and the like. An intangible product may include one or a combination of service products, financial products, intellectual products, internet products, etc. An internet product may be any product that meets a person's needs for information, entertainment, communication, or commerce. There are many classification methods. Taking the classification of its carrier platform as an example, the internet product may include one or a combination of several of a personal host product, a Web product, a mobile internet product, a commercial host platform product, an embedded product, and the like. The mobile internet product may be software, a program or a system used in a mobile terminal. The mobile terminal includes, but is not limited to, one or a combination of several of a notebook, a tablet computer, a mobile phone, a Personal Digital Assistant (PDA), an electronic watch, a POS machine, a vehicle-mounted computer, a television, and the like. For example, various social, shopping, travel, entertainment, learning, investment, etc. software or applications used on computers or cell phones. The travel software or application may be travel software, vehicle reservation, map and other software or applications. The traffic reservation software or application refers to one or a combination of several of horse, carriage, rickshaw (e.g., two-wheeled bicycle, three-wheeled vehicle, etc.), automobile (e.g., taxi, bus, etc.), train, subway, ship, aircraft (e.g., airplane, helicopter, space shuttle, rocket, hot air balloon, etc.), etc.
FIG. 2 depicts an architecture of a computer device that can be used to implement the particular system disclosed in this application. The particular system in this embodiment describes a hardware platform that includes a user interface using a functional block diagram. The computer may be a general purpose computer or may be a specific purpose computer. Both computers may be used to implement the particular system in this embodiment. Computer 200 may be used to implement any of the components presently described that provide the information needed to move a trip. For example: POI engine 110 can be implemented by a computer such as computer 200 through its hardware devices, software programs, firmware, and combinations thereof. For convenience, only one computer is depicted in fig. 2, but the related computer functions for providing information required for moving rows described in the present embodiment can be implemented in a distributed manner by a set of similar platforms, distributing the processing load of the system.
Computer 200 includes a communication port 250, to which is connected a network for enabling data communication. Computer 200 also includes a central processing system (CPU) unit for executing program instructions, comprised of one or more processors. The exemplary computer platform includes an internal communication bus 210, various forms of program storage units and data storage units, such as a hard disk 270, Read Only Memory (ROM)230, Random Access Memory (RAM)240, various data files capable of being used for computer processing and/or communication, and possibly program instructions for execution by the CPU. The computer 200 also includes an input/output component 260 that supports the flow of input/output data between the computer and other components, such as a user interface 280. Computer 200 may also receive programs and data over a communications network.
The foregoing outlines various aspects of a method of providing information needed to move a trip and/or a method of programmatically implementing other steps. Program portions of the technology may be thought of as "products" or "articles of manufacture" in the form of executable code and/or associated data, which may be referenced or implemented by computer readable media. Tangible, non-transitory storage media include memory or storage for use by any computer, processor, or similar device or associated module. Such as various semiconductor memories, tape drives, disk drives, or similar devices capable of providing storage functions for software at any one time.
All or a portion of the software may sometimes communicate over a network, such as the internet or other communication network. Such communication enables loading of software from one computer device or processor to another. For example, the following examples: the information is loaded from a management server or host computer of the mobile trip system to a hardware platform of a computer environment, or other computer environment for realizing the system, or a system with similar functions related to providing information required by mobile trip. Therefore, another medium capable of transferring the software elements may also be used as a physical connection between local devices, such as optical waves, electric waves, electromagnetic waves, etc., which are propagated through electric cables, optical cables or air. The physical medium used for the carrier wave, such as an electric, wireless or optical cable or the like, may also be considered as the medium carrying the software. As used herein, unless limited to a tangible "storage" medium, other terms referring to a computer or machine "readable medium" refer to media that participate in the execution of any instructions by a processor.
Thus, a computer-readable medium may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium, or a physical transmission medium. The stable storage medium comprises: optical or magnetic disks, and other computer or similar devices, that can implement the system components described in the figures. Volatile storage media include dynamic memory, such as the main memory of a computer platform. Tangible transmission media include coaxial cables, copper cables, and fiber optics, including the wires that form a bus within a computer system. Carrier wave transmission media may convey electrical, electromagnetic, acoustic, or light wave signals, which may be generated by radio frequency or infrared data communication methods. Common computer-readable media include hard disks, floppy disks, magnetic tape, any other magnetic medium; CD-ROM, DVD-ROM, any other optical medium; punch cards, any other physical storage medium containing a pattern of holes; RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge; a carrier wave transporting the data or instructions, a cable or connection transporting the carrier wave, any other program code and/or data readable by a computer. In the form of computer readable media, there are many variations of the instructions that may be presented to a processor for causing a process to be performed, one or more results of which may be delivered.
FIG. 3 depicts one architecture of a mobile device that can be used to implement the particular system disclosed in this application. In this example, the user device for displaying and interacting with location-related information is a mobile device 300, including, but not limited to, a smart phone, a tablet, a music player, a portable game player, a Global Positioning System (GPS) receiver, a wearable computing device (e.g., glasses, watch, etc.), or other forms. The mobile device 300 in this example includes one or more Central Processing Units (CPUs)340, one or more Graphics Processing Units (GPUs) 330, a display 320, a memory 360, an antenna 310, such as a wireless communication unit, a storage unit 390, and one or more input/output (I/O) devices 350. Any other suitable components, including but not limited to a system bus or a controller (not shown), may also be included in the mobile device 300. As shown in FIG. 3, a mobile operating system 370, such as IOS, Android, Windows Phone, etc., and one or more applications 380 may be loaded into memory 360 from storage unit 390 and executed by central processor 340. The applications 380 may include a browser or other mobile application suitable for receiving and processing location-related information on the mobile device 300. Text input by the user may be obtained and provided to the POI engine 110, and/or other components of the system 100, via the input/output system device 350, such as: through the network 150.
To implement the various modules, units, and their functionality described in the foregoing disclosure, a computer hardware platform may be used as a hardware platform for one or more of the elements described above (e.g., POI engine 110, and/or other components of system 100 described in fig. 1-7). The hardware elements, operating systems, and programming languages of such computers are commonplace in nature, and it is assumed that one skilled in the art is sufficiently familiar with such techniques to be able to provide the information needed to move about using the techniques described herein. A computer containing user interface elements can be used as a Personal Computer (PC) or other type of workstation or terminal device, suitably programmed, and also used as a server. Such structures, programs and the general operation of such computer devices are considered to be familiar to those skilled in the art, and no additional explanation is required for all figures.
FIG. 4 is an exemplary flow diagram of a semantic retrieval method according to some embodiments of the present application.
In step 410, a word segmentation vocabulary may be obtained. In some embodiments, the participle vocabularies may be obtained from the participle vocabularies apparatus 105, the storage device 130, the network 150, the information source 160. In one embodiment, the participle word list may be a collection of participle words related to one or more specific domains. For example, the specific field may be a field of transportation, dining, traveling, medical, shopping, etc. In a specific embodiment, the word segmentation table may be a point of interest word segmentation table. The point of interest participle vocabulary may be related to a user's travel. In some embodiments, the point of interest participle vocabulary contains minimum semantic units and/or maximum semantic units.
In step 420, the text entered by the user may be retrieved. In some embodiments, a user may enter text through the input/output component 260. For example, a user may enter text through a web page or application software. As another example, a user may enter information through a physical interface. The text input mode by the user can be handwriting operation, mouse operation, touch screen operation, key operation, voice control operation, gesture operation, eye-catching operation, voice operation and the like. The input content can be one or a combination of several of numbers, texts, sounds, images, videos, vibrations and the like. The text may be one or more sentences, one or more phrases, one or more words, or the like. In some embodiments, the text may be related to one or more products.
In some embodiments, the text entered by the user may be obtained by image recognition and/or voice recognition. For example, a text image is obtained by a camera of a mobile phone of a user, and an input text is obtained by text recognition. For another example, the user's voice input is obtained through the user's mobile phone microphone and recognized as text through voice recognition.
In step 430, the text input by the user may be segmented to obtain a first segmentation result. In some embodiments, the input text may be participled according to a point of interest participle vocabulary. The word segmentation mode can comprise coarse-grained word segmentation, fine-grained word segmentation or any combination thereof. The fine-grained word segmentation means that an original sentence is segmented into the most basic words. The coarse-grained word segmentation means that a plurality of basic words in an original sentence are combined and cut into one word, and then the word is combined into an entity with relatively determined semantics. In some embodiments, fine-grained word segmentation can be performed on a text input by a user according to a minimum semantic unit, so that a fine-grained word segmentation result is obtained. In some embodiments, one or more fine-grained tokenization results may be merged according to a maximum semantic unit to obtain a coarse-grained tokenization result. For example, the original string "Zhejiang university sits beside the west lake" is subjected to fine-grained segmentation, and a fine-grained segmentation result "Zhejiang/university/sit/west lake/side" is obtained, and then the fine-grained segmentation results are merged, and a coarse-grained segmentation result "Zhejiang university/sit/west lake/side" is obtained.
In some embodiments, the first segmentation result may further include a combination of a plurality of words having a probability of occurring at the same time greater than a set threshold. For example, in a certain time range (for example, 3h), one or more user-input texts including the number and/or the dividend are collected, the ratio of the number of the texts input by the user, which simultaneously include the number and the dividend, to the total number of the texts input by the user in the certain time range is calculated, and the probability that the number and the dividend simultaneously appear is obtained. When the probability of the simultaneous occurrence of the number and the divination is higher than 70%, the combination of the number and the divination is the word segmentation result of the number divination. In some embodiments, when two or more words exist in the user log and the interest point word segmentation table at the same time, the two or more words are word segmentation results. Under different application scenes, different word segmentation modes can be adopted. For example, in a scenario where the user selects an accurate search, the segmentation may be performed in a coarse-grained segmentation manner.
In step 440, the text may be searched according to the first segmentation result. In some embodiments, whether the user is interested in the searched digital goods is influenced by the way of word segmentation. For example, when fine-grained participles are used, the semantic expression of the text is affected, many results that are literally similar but semantically not very related are searched, and thus the user's interest level in the searched products is reduced. In some embodiments, the product may be a tangible product or an intangible product. A tangible product may be any tangible size or substance, such as one or a combination of several of food, medicine, daily necessities, chemical products, appliances, clothing, automobiles, real estate, luxury goods, and the like. An intangible product may include one or a combination of service products, financial products, intellectual products, internet products, etc. An internet product may be any product that meets a person's needs for information, entertainment, communication, or commerce. There are many classification methods. Taking the classification of its carrying platform as an example, the internet product may include one or a combination of several of a personal host product, a Web product, a mobile internet product, a commercial host platform product, an embedded product, and the like. The mobile internet product may be software, a program or a system used in a mobile terminal. In some embodiments, the product may also be a digital good. The digital goods may refer to goods stored in a digitized format. Such as databases, software, audio artifacts, stock indices, electronic periodicals, and the like.
The invention also provides a device corresponding to the steps of the method one by one.
FIG. 5 is a block diagram of a semantic retrieval device according to some embodiments of the present application. All or part of the functional modules in the device can run on the terminal processing equipment.
The semantic retrieval device may include a first obtaining module 510, a first segmentation module 520, and a search module 530. The connection between the modules may be wired, wireless, or a combination of both. Any one of the modules may be local, remote, or a combination of the two. The correspondence between the modules may be one-to-one, or one-to-many.
The first obtaining module 510 may obtain data. In some embodiments, the first acquisition module 410 may acquire data from the passenger-side device 120 and/or the driver-side device 140. In some embodiments, the first obtaining module 510 may obtain data from the participle word list device 105, the storage device 130, the network 150, the information source 160. The data obtained by the first obtaining module 510 may include a point of interest word segmentation table, a text input by a user, and the like. In some embodiments, the data acquired by the first acquisition module 510 may be sent to the first segmentation module 520 and/or the search module 530. For example, the first obtaining module 510 obtains a word list of the point of interest segmentation, and the first segmentation module 520 performs segmentation on the text according to the word list of the point of interest segmentation to obtain a segmentation result.
The first segmentation module 520 may obtain a segmentation result. In some embodiments, the first segmentation module may segment the text to obtain the segmentation. In some embodiments, the text may be user-entered text. The first segmentation module may implement segmentation according to a hidden markov model, a probabilistic language model, a chinese segmentation disambiguation model, or a combination thereof. In some embodiments, the hidden markov models can include weighted graph models and sequence annotation models.
In some embodiments, the first segmentation module 520 performs segmentation on the input text according to the point of interest segmentation word list to obtain a first segmentation result. In some embodiments, the first segmentation results obtained by the first segmentation module 520 may be sent to the search module 530 for searching for a product of interest.
The search module 530 may search the input text according to the first segmentation result. The manner in which the words are segmented may affect the level of interest the user has in the searched product. For example, when performing fine-grained word segmentation, the first segmentation result may affect the semantic expression of the input text, and many results that are similar in character but unrelated in semantic may be searched, so the user's interest level in the searched product is reduced. In some embodiments, the products searched for by the search module 530 may be output via the input/output component 260. The information output by the input/output component 260 may be one or a combination of numbers, text, sound, images, video, vibrations, etc.
FIG. 6 is an exemplary flow diagram of a method of obtaining a participle vocabulary according to some embodiments of the present application.
In step 610, a vocabulary model may be obtained. The vocabulary model (also referred to as a "segmentation dictionary") is a collection of segmented words that does not limit the application domain. The vocabulary model can be sourced and applied to various fields such as a search engine, a shopping website, a mobile trip and the like. In some specific embodiments, the vocabulary model may be composed of a plurality of domain-specific related subsets, such as a participle vocabulary (also referred to as "point-of-interest participle vocabulary") related to a taxi taking domain when the domain of interest is a taxi taking domain. In some embodiments, the participle words may be generated by segmenting the corpus using a participle model. In some embodiments, the frequency of use of the segmented words may also be counted. The higher the frequency of use, the higher the importance of the word-segmentation word. When the frequency of use is greater than or equal to a filtering threshold, the segmented words may be categorized in a segmentation dictionary.
In step 620, a corpus may be obtained. In some embodiments, the corpus may be text of a user's historical input. For example, the user enters the text "ride windward to Zhejiang university". The corpus may be implemented based on a data smoothing model and/or a corpus expansion model. The data smoothing model may include a Laplace algorithm, a Good-ringing algorithm, an absolute discount and linear discount algorithm, a Witten-Bell algorithm, and the like. The corpus expansion model may include synonym expansion and/or part-of-speech expansion.
In step 630, the corpus may be participled to obtain a second participle result. In some embodiments, the training corpus may be subjected to preliminary word segmentation according to the word segmentation dictionary, and a preliminary word, that is, a second word segmentation result, may be determined.
In step 640, a participle vocabulary may be determined based on the second participle result. In one embodiment, the word segmentation table may be a point of interest word segmentation table. I.e. a participle word list is a collection of participle words related to one or more specific fields. For example, the point of interest participle vocabulary may be associated with a user's mobile trip. The interest point word segmentation word list can further comprise a minimum semantic unit and/or a maximum semantic unit. In some embodiments, the point of interest participle vocabulary may be determined in a plurality of iterations. In multiple rounds of iteration, when the second word segmentation result meets a preset rule, the second word segmentation result is used as a word list model of a new word segmentation, and the next word segmentation is carried out. And obtaining a result after the iteration is finished, namely the interest point word segmentation word list. The preset rules may include usage frequency of words, relevance of the words to the field of travel, and the like, being higher than a certain threshold.
The frequency of use of the words is related to the importance of the words. In general, the higher the frequency of use, the higher the importance of the representative word. And when the use frequency of the words is greater than or equal to a filtering threshold value, the words accord with a preset rule, and the words can be classified into the interest point word segmentation word list. In some embodiments, the usage frequency may refer to the number of times a word occurs within a certain time window (e.g., within 3 hours). In some embodiments, the usage frequency may also refer to the number of times a word appears in all the segmented words obtained after segmentation. For example, the number of occurrences of "taxi" in all the participles is counted. The filtering threshold may be predetermined. For example, the filtering threshold is set to 10, and when the number of occurrences of the word "taxi" is 12 within 24h, "taxi" is included in the point-of-interest participle word list. For another example, the filtering threshold is set to 10, and when the number of occurrences of the word segmentation word "call car" in 24h is 9, the "call car" is not included in the point of interest segmentation word list.
In some embodiments, a new word may also be added to the point of interest participle word list. The new words are generated based on the user log and/or word characteristics.
The user log may be information about the passenger or driver collected by the information source 160. For example, the user log may include terms entered by the user and/or digital goods clicked.
The word characteristics may include a degree of cohesion of the word, a degree of freedom of the word, and/or a habit characteristic of a user using the word. Wherein, the degree of cohesion of the word means that if two or more words can be composed into a word, the two words can appear in the corpus simultaneously in the form of a word. The degree of freedom of the word means that two or more words may appear in separate forms in different contexts if the two or more words cannot be made up into a word. In some embodiments, the degree of cohesion and/or freedom of a word is determined by determining whether two or more words in the text entered by the user can form the word. For example, in a certain time range (for example, 3h), the text input by the user including the two or more words is collected, the ratio of the number of word texts including the two or more words to the total number of texts input by the user in the certain time range is calculated to obtain the probability that the two or more words appear in the form of words, and when the probability is greater than or equal to a certain threshold (for example, 60%), the two or more words are judged to have the degree of cohesion of the words. When the probability is less than or equal to a certain threshold (e.g., 40%), the two or more words are judged to have degrees of freedom of the word. The habit of the user word means that the word commonly appears in the text input by the user and the retrieval result selected by the user.
It should be noted that the above description of the semantic retrieval method is only for convenience of description and should not limit the present application to the scope of the illustrated embodiments. It will be appreciated by those skilled in the art that, having the benefit of the teachings of the present application, changes may be made to the method of semantic retrieval without departing from such principles. For example, steps may be added, subtracted, combined, or split. In some embodiments, step 620 is performed first, and then step 610 is performed. Such variations are within the scope of the present application.
The invention also provides a device corresponding to the steps of the method one by one.
Fig. 7 is a block diagram of the participle word list device 105 according to some embodiments of the present application. All or part of the functional modules in the device can run on the processing equipment at the service end.
The participle list means 105 may comprise a second retrieving module 710, a second participle module 720, and a determining module 730. The connection between the modules may be wired, wireless, or a combination of both. Any one of the modules may be local, remote, or a combination of the two. The correspondence between the modules may be one-to-one, or one-to-many.
The second acquisition module 710 may acquire data. The second obtaining module 710 can obtain data from the passenger-side device 120, the driver-side device 140, the storage device 130, the network 150, and the information source 160. The data obtained by the second obtaining module 710 may include a vocabulary model and/or a corpus, etc. In some embodiments, the data acquired by the second acquisition module 710 may be sent to the second participle module 720. For example, the vocabulary model and/or the corpus acquired by the second acquiring module 710 may be sent to the second participle module 720, and the second participle module 720 performs participle on the corpus according to the vocabulary model to obtain a second participle result.
The second segmentation module 720 may determine a second segmentation result. The second segmentation module 520 may perform segmentation on the corpus according to the vocabulary model. In some embodiments, the second segmentation result obtained by the second segmentation module 520 may be sent to the determination module 730 for determining the point of interest segmentation vocabulary.
The determination module 730 may determine the information. The information may include a point of interest participle vocabulary. In some embodiments, the point of interest participle vocabulary may be related to a user's travel. In some embodiments, the information determined by the determination module 730 may be sent to the first obtaining module 710 for word segmentation of the text input by the user.
It should be noted that the above description of processing modules is merely a specific example and should not be considered the only possible embodiment. Each of the above modules or units is not necessary, each of the modules or units may be implemented by one or more components, and the function of each of the modules or units is not limited thereto. The modules or units can be selectively added or deleted according to specific implementation scenes or requirements. It will be clear to a person skilled in the art that, having the understanding of the basic principles of semantic search, it is possible to make various modifications and changes in form and detail of specific embodiments and steps of the processing module, and that several simple deductions or substitutions may be made, and that certain adjustments, combinations or divisions of the sequence of modules or units may be made without inventive effort, but these modifications and changes are still within the scope of the above description.
Moreover, those skilled in the art will appreciate that aspects of the present application may be illustrated and described in terms of several species or contexts of patentability, including any new and useful combination of procedures, machines, manufacture, or materials, or any new and useful improvement thereof. Accordingly, various aspects of the present application may be embodied entirely in hardware, entirely in software (including firmware, resident software, micro-code, etc.), or in a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present application may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media.
A computer readable signal medium may contain a propagated data signal with computer program code embodied therein, for example, at baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, and the like, or any suitable combination. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code on a computer readable signal medium may be propagated over any suitable medium, including radio, electrical cable, fiber optic cable, RF, or the like, or any combination of the preceding.

Claims (12)

1. A method of semantic retrieval, the method comprising:
acquiring a word segmentation word list;
acquiring a text input by a user;
performing word segmentation on the text according to a word segmentation word list, and determining a first word segmentation result;
and searching the text according to the first word segmentation result.
2. The method for semantic retrieval of claim 1, wherein the first segmentation result comprises a fine-grained segmentation result and a coarse-grained segmentation result.
3. The method for semantic retrieval of claim 1, wherein the first segmentation result comprises a combination of a plurality of words having a probability of occurring simultaneously greater than a set threshold.
4. The method for semantic retrieval of claim 1, wherein the method for generating the participle vocabulary comprises:
acquiring a word list model;
acquiring a training corpus;
performing word segmentation on the training corpus according to the word list model to obtain a second word segmentation result;
and determining a word segmentation word list according to the second word segmentation result.
5. The method of obtaining a vocabulary of words in claim 4, wherein said determining a vocabulary of words based on the second result of words segmentation further comprises: and determining a word segmentation word list in a multi-round iteration mode.
6. The method of obtaining a thesaurus as claimed in claim 4, wherein each iteration of said plurality of iterations further comprises:
performing preliminary word segmentation on the training corpus according to the word list model, and determining a preliminary word;
acquiring a preset rule;
judging whether the preliminary words meet preset rules or not;
and when the initial word is in accordance with a preset rule, adding the initial word into the word list model, generating a new word list model, and performing the next word segmentation.
7. The method of claim 1, wherein the method of generating a vocabulary of words comprises:
acquiring a user log, wherein the user log comprises search words input by a user or search results selected by the user;
determining new words according to the user log;
and adding the new words into the word segmentation word list.
8. The method of claim 1, wherein the method of generating a vocabulary of words comprises:
acquiring word characteristics, wherein the word characteristics comprise the degree of cohesion of words, the degree of freedom of words and/or the habit characteristics of words used by a user;
determining a new word according to the word characteristic;
and adding the new words into the word segmentation word list.
9. The word list of claim 1 being a point of interest word list.
10. A semantic retrieval apparatus comprising:
the first acquisition module is used for acquiring a word segmentation word list and a text input by a user;
a first word segmentation module for segmenting the text according to the word segmentation word list and determining the segmentation result, an
And the searching module is used for searching the text according to the word segmentation result.
11. The semantic retrieval device of claim 10 further comprising a participle vocabulary generation module comprising:
the second acquisition module is used for acquiring a word list model and acquiring training corpora;
a second word segmentation module for segmenting the training corpus according to the vocabulary model, determining the segmentation result, and
and the determining module is used for determining the word segmentation word list according to the word segmentation result.
12. A computer-readable storage medium, wherein the storage medium stores computer instructions, and when the computer instructions in the storage medium are read by a computer, the computer performs the method of any one of claims 1-9.
CN201810554080.3A 2018-06-01 2018-06-01 Semantic retrieval method and device Pending CN110633352A (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201810554080.3A CN110633352A (en) 2018-06-01 2018-06-01 Semantic retrieval method and device
PCT/CN2019/081444 WO2019228065A1 (en) 2018-06-01 2019-04-04 Systems and methods for processing queries
US17/093,664 US20210089531A1 (en) 2018-06-01 2020-11-10 Systems and methods for processing queries

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810554080.3A CN110633352A (en) 2018-06-01 2018-06-01 Semantic retrieval method and device

Publications (1)

Publication Number Publication Date
CN110633352A true CN110633352A (en) 2019-12-31

Family

ID=68966189

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810554080.3A Pending CN110633352A (en) 2018-06-01 2018-06-01 Semantic retrieval method and device

Country Status (1)

Country Link
CN (1) CN110633352A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444716A (en) * 2020-03-30 2020-07-24 深圳市微购科技有限公司 Title word segmentation method, terminal and computer readable storage medium
CN111611450A (en) * 2020-05-12 2020-09-01 深圳力维智联技术有限公司 Cross-media data fusion method and device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942188A (en) * 2013-01-22 2014-07-23 腾讯科技(深圳)有限公司 Method and device for identifying corpus languages
CN105389349A (en) * 2015-10-27 2016-03-09 上海智臻智能网络科技股份有限公司 Dictionary updating method and apparatus
CN105786782A (en) * 2016-03-25 2016-07-20 北京搜狗科技发展有限公司 Word vector training method and device
CN106874492A (en) * 2017-02-23 2017-06-20 北京京东尚科信息技术有限公司 Searching method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942188A (en) * 2013-01-22 2014-07-23 腾讯科技(深圳)有限公司 Method and device for identifying corpus languages
CN105389349A (en) * 2015-10-27 2016-03-09 上海智臻智能网络科技股份有限公司 Dictionary updating method and apparatus
CN105786782A (en) * 2016-03-25 2016-07-20 北京搜狗科技发展有限公司 Word vector training method and device
CN106874492A (en) * 2017-02-23 2017-06-20 北京京东尚科信息技术有限公司 Searching method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444716A (en) * 2020-03-30 2020-07-24 深圳市微购科技有限公司 Title word segmentation method, terminal and computer readable storage medium
CN111611450A (en) * 2020-05-12 2020-09-01 深圳力维智联技术有限公司 Cross-media data fusion method and device and storage medium
CN111611450B (en) * 2020-05-12 2023-06-13 深圳力维智联技术有限公司 Cross-media data fusion method, device and storage medium

Similar Documents

Publication Publication Date Title
CN107291828B (en) Spoken language query analysis method and device based on artificial intelligence and storage medium
CN109074803B (en) Voice information processing system and method
JP6559792B2 (en) Order pairing system and method
CN107992596B (en) Text clustering method, text clustering device, server and storage medium
CN109947919B (en) Method and apparatus for generating text matching model
CN112106056A (en) Constructing fictitious utterance trees to improve the ability to answer convergent questions
CN111753551B (en) Information generation method and device based on word vector generation model
CN111460248B (en) System and method for on-line to off-line service
CN110709828A (en) System and method for determining text attributes using conditional random field model
CN111414561B (en) Method and device for presenting information
CN115982376B (en) Method and device for training model based on text, multimode data and knowledge
Nawa et al. Cyber physical system for vehicle application
CN110569335A (en) triple verification method and device based on artificial intelligence and storage medium
CN112307774B (en) Dialogue understanding method and device, readable medium and electronic equipment
CN113806588A (en) Method and device for searching video
CN111414471B (en) Method and device for outputting information
CN111201421A (en) System and method for determining optimal transport service type in online-to-offline service
CN110633352A (en) Semantic retrieval method and device
EP4359956A1 (en) Smart summarization, indexing, and post-processing for recorded document presentation
CN114298007A (en) Text similarity determination method, device, equipment and medium
CN111444335A (en) Method and device for extracting central word
CN111191107B (en) System and method for recalling points of interest using annotation model
CN117171328A (en) Text question-answering processing method and device, electronic equipment and storage medium
CN111859168A (en) Method and system for determining interest points
CN116821327A (en) Text data processing method, apparatus, device, readable storage medium and product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191231

RJ01 Rejection of invention patent application after publication