CN112687403B - Medicine dictionary generation and medicine search method and device - Google Patents

Medicine dictionary generation and medicine search method and device Download PDF

Info

Publication number
CN112687403B
CN112687403B CN202110025121.1A CN202110025121A CN112687403B CN 112687403 B CN112687403 B CN 112687403B CN 202110025121 A CN202110025121 A CN 202110025121A CN 112687403 B CN112687403 B CN 112687403B
Authority
CN
China
Prior art keywords
medicine
dictionary
text sequence
drug
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110025121.1A
Other languages
Chinese (zh)
Other versions
CN112687403A (en
Inventor
张敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rajax Network Technology Co Ltd
Original Assignee
Rajax Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rajax Network Technology Co Ltd filed Critical Rajax Network Technology Co Ltd
Priority to CN202110025121.1A priority Critical patent/CN112687403B/en
Publication of CN112687403A publication Critical patent/CN112687403A/en
Application granted granted Critical
Publication of CN112687403B publication Critical patent/CN112687403B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The application discloses a medicine dictionary generation method and a medicine search method. The method for generating the medicine dictionary comprises the following steps: acquiring a medicine text sequence of a medicine object to be searched, and generating a subsequence of the medicine text sequence; generating a candidate drug dictionary based on the drug text sequence and the subsequence of the drug text sequence; and generating a target medicine dictionary according to the candidate medicine dictionary and the medicine corpus data collected in advance. The medicine searching method comprises the following steps: acquiring search information of a user; using the search information to query a preset medicine dictionary to obtain a medicine text sequence corresponding to the search information; acquiring information of at least one medicine object corresponding to the medicine text sequence; wherein the preset medicine dictionary is the right target medicine dictionary. By adopting the method, the problem of low accuracy of medicine searching is solved.

Description

Medicine dictionary generation and medicine search method and device
Technical Field
The application relates to the technical field of data processing, in particular to a method, a device and equipment for generating a medicine dictionary. The application also relates to a medicine searching method and device.
Background
With the development of the internet, medicine search scenes are more and more important. In practical applications, the medicine searching has a special shape, and the medicine name is generally long and complicated, so that the user often inputs part of the medicine name to search. If the expected medicine for searching for the coral tinea is coral ringworm, the expected medicine for searching for the Vena is venlafaxine.
Existing search schemes are generally based on word-segmentation index search or single-word index search. However, with the search scheme based on the word segmentation index, word segmentation data is generally mined from context information of search content of a user, the degree of dependence on the search volume of the user is high, and correct medicines are difficult to recall in case of inputting part of medicine names, so that a recall result is often not obtained. For example, if the name of the drug is "compound ketoconazole", the word segmentation result is "compound" + "ketoconazole", and the user searches for "compound ketone" or "compound ketoconazole", the drug cannot be recalled from the index. However, if a single word index search scheme is adopted, performance is greatly affected, and thus search efficiency is low.
Therefore, how to improve the accuracy of medicine searching and ensure searching performance is a problem to be solved.
Disclosure of Invention
The medicine dictionary generating method and the medicine searching method provided by the embodiment of the application generate a more accurate medicine dictionary, solve the problem of low accuracy of medicine searching and ensure searching performance.
The embodiment of the application provides a method for generating a medicine dictionary, which comprises the following steps: acquiring a medicine text sequence of a medicine object to be searched, and generating a subsequence of the medicine text sequence; generating a candidate drug dictionary based on the drug text sequence and the subsequence of the drug text sequence; generating a target medicine dictionary according to the candidate medicine dictionary and the medicine corpus data collected in advance; the target drug dictionary is used for searching target drug objects from drug objects to be searched.
Optionally, the generating a subsequence of the text sequence of the medicine includes: generating a subsequence of the drug text sequence based on elements in the drug text sequence and subsequent elements of the elements.
Optionally, the method further includes: determining a step size for generating the subsequence; taking the step size as the length of the subsequence; and aiming at the elements in the medicine text sequence, determining one or more elements which are adjacent to the right of the elements according to the length as the subsequent elements of the elements.
Optionally, the generating a candidate drug dictionary based on the drug text sequence and the subsequence of the drug text sequence includes: establishing a corresponding relation between the medicine text sequence and each subsequence of the medicine text sequence; and generating entry data according to each subsequence and the medicine text sequence corresponding to the subsequence, and generating the candidate medicine dictionary by using the entry data.
Optionally, the generating a target drug dictionary according to the candidate drug dictionary and pre-collected drug corpus data includes: and generating a target medicine dictionary according to the occurrence frequency and/or the occurrence times of the entry data in the candidate medicine dictionary in the pre-collected medicine corpus data.
Optionally, the generating a target drug dictionary according to the occurrence frequency and/or the occurrence frequency of the entry data in the candidate drug dictionary in the pre-collected drug corpus data includes: determining the occurrence frequency and/or the occurrence frequency of the subsequence of the vocabulary entry data in the candidate medicine dictionary in the medicine corpus data as a first word frequency; determining the occurrence frequency and/or the occurrence frequency of the medicine text sequence corresponding to the subsequence in the entry data in the medicine corpus data as a second word frequency; and if the second word frequency is greater than a preset text validity judgment threshold value and the ratio of the first word frequency to the second word frequency meets a preset entry validity judgment condition, adding the entry data to the target medicine dictionary.
Optionally, the acquiring a text sequence of a drug object to be searched includes: acquiring description information and/or labeling information of the medicine object to be searched; and extracting the names of the general medicines from the description information and/or the labeling information to be used as the medicine text sequence.
Optionally, the method further includes: carrying out directional information acquisition on a designated professional website related to medicines to obtain information acquisition data related to the medicines; and filtering and screening the information acquisition data to obtain the medicine corpus data.
Optionally, the acquiring directional information of the specified professional website related to the medicine to obtain information acquisition data related to the medicine includes: acquiring search result data aiming at the keywords relevant to the medicines from a first link address corresponding to the specified professional website by using the keywords relevant to the medicines as search contents; analyzing a second link address from the search result data, and acquiring page content corresponding to the second link address; and analyzing text resources related to the medicine from the page content corresponding to the second link address to serve as the information acquisition data related to the medicine.
Optionally, the acquiring directional information of the specified professional website related to the medicine to obtain information acquisition data related to the medicine includes: acquiring home page navigation information of the specified professional website; acquiring page contents of columns and/or classification plates related to medicines according to the home page navigation information; and analyzing text resources related to the medicines from the page content to serve as the information acquisition data related to the medicines.
The embodiment of the present application further provides a drug search method, including: acquiring search information of a user; using the search information to query a preset medicine dictionary to obtain a medicine text sequence corresponding to the search information; acquiring information of at least one medicine object corresponding to the medicine text sequence as a medicine search result aiming at the search information; the preset medicine dictionary is the target medicine dictionary provided by the method.
Optionally, the method further includes: establishing an inverted index table of the drug objects in advance by taking the drug text sequence as an index; the inverted index table is used for inquiring the information of the medicine objects according to the medicine text sequence; the acquiring information of at least one drug object corresponding to the drug text sequence includes: taking the medicine text sequence as an index, and inquiring the inverted index table; and obtaining information of at least one medicine object according to the query result.
Optionally, the drug search result is a link address of the drug object; the method further comprises the following steps: outputting the link address of the drug object; and receiving trigger information aiming at the link address, and outputting a detail page of the medicine object according to the trigger information.
Optionally, the method further includes: and taking the medicine text sequence corresponding to the search information as a prompt word, and outputting the prompt word.
An embodiment of the present application further provides a device for generating a medicine dictionary, including: the device comprises a text sequence and subsequence acquisition unit, a search unit and a search unit, wherein the text sequence and subsequence acquisition unit is used for acquiring a medicine text sequence of a medicine object to be searched and generating a subsequence of the medicine text sequence; a candidate dictionary generating unit for generating a candidate medicine dictionary based on the medicine text sequence and the subsequence of the medicine text sequence; the target dictionary generating unit is used for generating a target medicine dictionary according to the candidate medicine dictionary and the medicine corpus data collected in advance; the target drug dictionary is used for searching target drug objects from drug objects to be searched.
An embodiment of the present application further provides a medicine search device, including: a search input unit for acquiring search information of a user; the search text determining unit is used for inquiring a preset medicine dictionary by using the search information to obtain a medicine text sequence corresponding to the search information; a search result unit, configured to acquire information of at least one drug object corresponding to the drug text sequence as a drug search result for the search information; the preset medicine dictionary is a target medicine dictionary provided by the method.
An embodiment of the present application further provides an object search method, including: acquiring an object text sequence of an object associated with a specific category, and generating a subsequence of the object text sequence; generating a candidate dictionary based on the object text sequence and the subsequence of the object text sequence; generating a target dictionary according to the candidate dictionary and the corpus data of the specific category collected aiming at the specific category; the target dictionary is used to search for target objects from the objects associated with the particular category.
Optionally, the generating a subsequence of the object text sequence includes: determining a step size for generating the subsequence; taking the step size as the length of a subsequence; for an element in the object text sequence, determining one or more next elements which are right adjacent to the element according to the length; and taking the text sequence formed by the elements and the subsequent elements as a subsequence of the object text sequence.
Optionally, the generating a target dictionary according to the candidate dictionary and the corpus data of the specific category collected for the specific category includes: and generating a target dictionary according to the occurrence frequency and/or the occurrence times of the entry data in the candidate dictionary in the specific category corpus data.
Optionally, the method further includes: determining a professional website matched with the specific category and acquiring directional information to obtain original corpus data related to the specific category; and filtering and screening the original corpus data to obtain the specific category corpus data.
Optionally, the method further includes: acquiring search content input by a user; querying the target dictionary by using the search content to obtain one or more object text sequences matched with the search content; taking the object text sequence as a search prompt word, and outputting the search prompt word; or acquiring data of at least one object corresponding to the object text sequence, and outputting the search result data as search result data for the search content.
An embodiment of the present application further provides an electronic device, including: a memory, and a processor; the memory is used for storing a computer program, and the computer program is executed by the processor to execute the method provided by the embodiment of the application.
The embodiment of the present application further provides a storage device, in which a computer program is stored, and the computer program is executed by the processor to perform the method provided in the embodiment of the present application.
Compared with the prior art, the method has the following advantages:
according to the method, the device and the equipment for generating the medicine dictionary, the candidate medicine dictionary is generated based on the medicine text sequence of the medicine object to be searched and the subsequence of the medicine text sequence; and generating a target medicine dictionary according to the candidate medicine dictionary and the medicine corpus data collected in advance. Because the medicine text sequence and the subsequence thereof are directly mined from the data of the medicine object to be searched, and the medicine corpus data collected in advance can cover the medicine object to a large extent, the target medicine dictionary generated on the basis has high precision. Furthermore, the medicine corpus data is information directionally collected based on a specified professional website, so that the target medicine dictionary does not depend on the search volume of a user, and the search range can be covered by data to a greater extent. The method is used for searching the medicine, solves the problem of low accuracy of medicine searching, and can ensure the searching performance.
According to the medicine searching method, the medicine searching device and the medicine searching equipment, a medicine text sequence corresponding to searching information is determined according to the searching information of a user and a preset medicine dictionary; acquiring information of a medicine object corresponding to the medicine text sequence as a medicine search result aiming at the search information; wherein the preset medicine dictionary: generating a candidate medicine dictionary based on a medicine text sequence of a medicine object to be searched and a subsequence of the medicine text sequence; and generating according to the candidate medicine dictionary and the medicine corpus data collected in advance. Because the medicine text sequence and the subsequence thereof are directly mined from the data of the medicine object to be searched, and the medicine corpus data collected in advance can cover the medicine object to a large extent, the target medicine dictionary generated on the basis has high precision. Furthermore, the medicine corpus data is information directionally collected based on a specified professional website, so that the target medicine dictionary does not depend on the search volume of a user, and the search range can be covered by data to a greater extent. The method is used for searching the medicine, solves the problem of low accuracy of medicine searching, and can ensure the searching performance.
The embodiment of the application also provides an object searching method and device, wherein a subsequence of an object text sequence is generated by acquiring the object text sequence of an object associated with a specific category; generating a candidate dictionary based on the object text sequence and the subsequence of the object text sequence; generating a target dictionary according to the candidate dictionary and the corpus data of the specific category collected aiming at the specific category; the target dictionary is used to search for target objects from the objects associated with the particular category. The target dictionary generated on the basis of the object text sequence and the subsequence thereof are mined directly from the data of the object associated with the specific category, and the corpus data of the specific category collected aiming at the specific category can cover the object associated with the specific category to a larger extent, so that the precision of the target dictionary is high. Furthermore, the specific category corpus data is information directionally collected based on the professional website, and the quality and the coverage range of the corpus data can be better guaranteed, so that the target dictionary can cover the data of the search range corresponding to the specific category to a greater extent, and the target dictionary does not depend on the search amount of the user. The method is used for object search, so that the problem of low accuracy of object search is solved, and the search performance can be ensured.
Drawings
FIG. 1 is a schematic diagram of an application environment provided by the present application;
fig. 2 is a process flow chart of a method for generating a medicine dictionary according to a first embodiment of the present application;
FIG. 3 is a flowchart of a method for constructing a dictionary of medicines according to a first embodiment of the present application;
FIG. 4 is a process flow diagram of a drug search method according to a second embodiment of the present application;
fig. 5 is a schematic diagram of a medicine dictionary generating device according to a third embodiment of the present application;
FIG. 6 is a schematic view of a medicine searching device according to a fourth embodiment of the present application;
fig. 7 is a process flow diagram of an object searching method according to a fifth embodiment of the present application;
fig. 8 is a schematic diagram of an electronic device provided herein.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
The embodiment of the application provides a method and a device for generating a medicine dictionary, electronic equipment and storage equipment. The embodiment of the application also provides a medicine searching method and device, electronic equipment and storage equipment. The following examples are individually illustrated.
For ease of understanding, an application environment of the method provided in the embodiments of the present application is first given. Referring to fig. 1, a user terminal 101 used by a user, a platform 102, a preset drug dictionary 103, and a drug object information base 104 are shown. The user terminal 101 is a device in which a client used for the user access platform 102 to perform a medicine-related search is installed. For example, the client is a client including a medicine channel, and the device of the client is a mobile phone or other electronic device provided with the client; a user may enter medication-related search content through a search box of a medication channel of a client, for which medication search results are provided by the platform 102. The platform 102 is a platform that provides medical subject information queries and/or medical subject transaction links to users. The client may access the platform. The preset medicine dictionary 103 is a medicine dictionary generated according to the method provided in the embodiment of the present application, and is used for determining a medicine name corresponding to search contents input by a user. When the search content of the user is a partial medicine name, the complete medicine name can be obtained, so that the medicine object to be searched by the user can be recalled correctly. The drug object information repository 104 stores information about drug objects that may be queried by the platform and/or information about tradable drug objects that may be provided to a user via a medical transaction link of the platform. The pharmaceutical object may refer to any of medical services, medical supplies, and medicines. Specifically, the platform receives search content sent by a client and input by a user in a medicine channel search box of the client; the platform inquires the medicine name corresponding to the searched content according to a preset medicine dictionary; and the platform queries a medicine object information base according to the medicine name, returns the queried information of the medicine object to the client as a medicine search result aiming at the search content, and displays the medicine object at the client. Further, the platform can also return the inquired medicine name as a search prompt word to the client and display the search prompt word in a pull-down area of the medicine channel search box.
The method for generating the medicine dictionary provided by the embodiment of the application generates a candidate medicine dictionary based on the medicine text sequence and the subsequence of the medicine object to be searched; and generating a target medicine dictionary based on the candidate medicine dictionary and the pre-collected medicine corpus data. Therefore, the full data of the medicine objects to be searched can be covered to a large extent, the search quantity of the user is avoided, the medicine objects can be correctly recalled in the scene that the user searches partial text names of the medicine objects, and meanwhile, the searching performance is guaranteed.
A medicine dictionary creating method according to a first embodiment of the present application will be described below with reference to fig. 2 and 3. The method for generating a medicine dictionary shown in fig. 2 includes: step S201 to step S203.
Step S201, a medicine text sequence of a medicine object to be searched is obtained, and a subsequence of the medicine text sequence is generated.
The medicine objects to be searched are medicine objects contained in the search range. The pharmaceutical object may refer to any object of medical services, medical supplies and medicines. For example, the drug objects to be searched may refer to drug objects that may be queried by the platform and/or tradable drug objects provided to the user via a medical transaction link of the platform, and further may be embodied as full-size medical goods of the platform. The drug object is a specific object which is specially used for treating diseases and needs to be managed according to legal standards. Thus, drug objects may be characterized using legal names. For example, the "common drug name" is used as a common name of the drug. However, in practical application, the drug object can be a commodity object and can also have the trade name of the drug. The same general medicine name can correspond to a plurality of medicine trade names. For example, acetaminophen compound formulations have trade names: baijunsu, tylosin, and bitridon.
In this embodiment, the medicine text sequence refers to text data for representing a medicine object, and one embodiment of the medicine text sequence is a name of a general medicine. The same drug text sequence may have multiple drug product names to distinguish different pharmaceutical products. The general medicine names of the medicine objects to be searched are used as data bases for constructing a medicine dictionary, the whole quantity of medicine commodities in a preset search range can be covered, and the objects to be searched form the preset search range. Therefore, one or more similar drug objects matched with the searched content in all the objects to be searched can be searched, and the efficiency and the accuracy of recalling the drug objects are improved; wherein, the search content is a medicine search word containing part or the whole general medicine name. Of course, the drug trade name may be a drug text sequence of the drug target. Specifically, the text sequence of the medicine object to be searched can be obtained through the following processes: acquiring description information and/or labeling information of the medicine object to be searched; and extracting a universal medicine name from the description information and/or the labeling information to be used as the medicine text sequence. In fact, the drug specification generally includes a field for storing the generic drug name data of each drug object, and the generic drug name can be mined and manually labeled from the medical commodity data of the platform. The medical commodity data of the platform is billion-level data volume, and universal medicine name data which are obtained by mining are ten-thousand-level data volume, so that the data level of mass data is reduced, and the generation and updating efficiency of a medicine dictionary can be improved.
In this embodiment, a candidate drug dictionary needs to be generated according to the drug text sequence and the subsequence thereof, and the candidate drug dictionary is used for screening out entry data for generating the target drug dictionary in the subsequent steps. Of course, the candidate medicine dictionary may be used to search the medicine text sequence corresponding to the subsequence according to the input subsequence. Specifically, medicine search words containing partial or whole common medicine names can be extracted according to search contents input by a user, the medicine search words are used as subsequences to query the candidate medicine dictionary, the common medicine names corresponding to the medicine search words are obtained from queried entry data, and therefore information of medicine objects corresponding to the complete common medicine names expected to be queried by the user can be predicted. In practical application, the candidate medicine dictionary can be an n-gram dictionary constructed according to an n-gram language model. The n-gram dictionary is a set of entry data corresponding to a string of length n. n refers to the length of the subsequence in the candidate drug dictionary. Wherein, the generating the subsequence of the text sequence of the medicines specifically includes: generating a subsequence of the drug text sequence based on elements in the drug text sequence and subsequent elements of the elements. In the implementation, still include: determining a step size for generating the subsequence; taking the step size as the length of the subsequence; and aiming at the elements in the medicine text sequence, determining one or more elements right adjacent to the elements according to the length as the subsequent elements of the elements. One specific implementation way is as follows: setting a universal medicine name string as W1W2 … Wm, wherein Wi (1 < = i < = m) is one character or one Chinese character; the subsequence n-gram is a string of length n; the step length of n ranges from 2 to m-1; when n takes 2, the subsequences include W1W2, W2W 3. And traversing the step length, and extracting substrings from the universal drug names by using a sliding window with each step length as the size, so that all the subsequences can be obtained and used as subsequences for constructing a candidate drug dictionary. Taking the general drug name of 'compound ketoconazole' as an example, the extracted subsequence is as follows: the subsequence with the step length of 2 is compound, ketone, ketokang and conazole; the subsequence with the step length of 3 is compound ketone, fang Tongkang and ketoconazole; the subsequence with step size 4 is: compound ketoconazole and ketoconazole.
Step S202, generating a candidate medicine dictionary based on the medicine text sequence and the subsequence of the medicine text sequence.
In this embodiment, each piece of data stored in the candidate drug dictionary is entry data including a subsequence and a drug text sequence corresponding to the subsequence. The sub-sequence is specifically the sub-string extracted from the general drug name string according to a certain step length in the above steps. The method specifically comprises the following steps: establishing a corresponding relation between the medicine text sequence and each subsequence of the medicine text sequence; and generating entry data according to each subsequence and the medicine text sequence corresponding to the subsequence, and generating the candidate medicine dictionary by using the entry data. Of course, the candidate drug dictionary may be updated periodically to ensure that the full amount of generic drug names corresponding to the objects to be searched are recorded. The candidate medicine dictionary generated based on the medicine text sequence and the subsequence thereof does not need the user to search the content context, but directly mines the entry data from the search data, and therefore, has no dependence on the search amount.
Step S203, generating a target medicine dictionary according to the candidate medicine dictionary and the medicine corpus data collected in advance; the target drug dictionary is used for searching target drug objects from drug objects to be searched.
The corpus refers to a set of text resources of a certain quantity and size. The medicine corpus data refers to a set of constructed text data related to medicines. In implementation, the original drug corpus data may be acquired from internet public resources, and the data obtained by preprocessing the original drug corpus data is used as the pre-collected drug corpus data, for example, the original drug corpus data is subjected to data preprocessing such as deduplication and cleaning.
In this embodiment, the directional information is collected based on a designated medical professional website, and the medicine corpus data is obtained by preprocessing. The method specifically comprises the following steps: carrying out directional information acquisition on a designated professional website related to medicines to obtain information acquisition data related to the medicines; and filtering and screening the information acquisition data to obtain the pre-acquired medicine corpus data. Wherein, the appointed professional website comprises at least one of a medicine professional website and a medicine related encyclopedia website. The resources disclosed by the medical professional website and the medical related encyclopedia website are generally high-quality effective medical information, and the reliability of data can be ensured. In one embodiment, the performing targeted information collection on a specific professional website related to medicine to obtain information collection data related to medicine specifically includes: acquiring search result data aiming at the keywords relevant to the medicines from a first link address corresponding to the specified professional website by using the keywords relevant to the medicines as search contents; analyzing a second link address from the search result data, and acquiring page content corresponding to the second link address; and analyzing text resources related to the medicine from the page content corresponding to the second link address to serve as the information acquisition data related to the medicine. In one embodiment, the performing targeted information collection on a specific professional website related to medicine to obtain information collection data related to medicine includes: acquiring home page navigation information of the specified professional website; acquiring page contents of columns and/or classification plates related to medicines according to the home page navigation information; and analyzing text resources related to the medicines from the page contents to serve as the information acquisition data related to the medicines. The two embodiments described above may be implemented independently or in combination. For example, the encyclopedic website is used for collecting medicine data, and the encyclopedic website can be searched by using the medicine name as a medicine search word. And resolving a URL corresponding to the medicine name according to the search result, and accessing the URL to collect and resolve data to obtain original medicine corpus data, wherein the medicine name can be a general medicine name or a medicine trade name. For example, for data acquisition of a professional medicine website, first page navigation information of the website can be acquired, page information related to diseases, symptoms, medicines and the like can be analyzed, and data displayed in the page can be acquired and analyzed to obtain original medicine corpus data. The pre-collected medicine corpus data obtained through the processing is high-quality external medicine data meeting certain quality requirements. The external medicine data is reliable medicine data acquired by a public professional website resource different from the platform, so that a wide range of common medicine names is covered, the coverage range of a generated medicine dictionary is ensured, and the information of the medicine object corresponding to the search content of the user is accurately searched from the object to be searched. The external medicine data is used as a data basis for screening entry data in the candidate medicine dictionary, and the entry data for forming the candidate medicine dictionary is directly mined from the search data (namely the data of the object to be searched), so that the dependency on the search content context and the search volume of the user is avoided, the high-quality entry data can be obtained, and the accurate target medicine dictionary can be obtained. Further, by using the external medicine data to filter the entry data for forming the target medicine dictionary, it is possible to avoid the problem that search logs need to be accumulated and the problem that the medicine name coverage of the low frequency search is not sufficient.
In the present embodiment, the target medicine dictionary is generated by the following processing: and generating a target medicine dictionary according to the occurrence frequency and/or the occurrence times of the entry data in the candidate medicine dictionary in the pre-collected medicine corpus data. The method specifically comprises the following steps: determining the occurrence frequency and/or the occurrence frequency of the subsequence of the vocabulary entry data in the candidate medicine dictionary in the medicine corpus data as a first word frequency; determining the occurrence frequency and/or the occurrence frequency of the medicine text sequence corresponding to the subsequence in the entry data in the medicine corpus data as a second word frequency; and if the second word frequency is greater than a preset text validity judgment threshold value and the ratio of the first word frequency to the second word frequency meets a preset entry validity judgment condition, adding the entry data to the target medicine dictionary. And the preset text validity judgment threshold is used for judging the validity of the medicine text sequence. And presetting entry validity judgment conditions for judging the validity of entry data in the candidate medicine dictionary. The first word frequency and the second word frequency can represent the occurrence frequency or the occurrence probability of the entry data in the actual corpus environment. For example, the word frequency of the entry data in the candidate medicine dictionary in the medicine corpus data is counted. Setting the word frequency of the subsequence in the candidate medicine dictionary in the medicine corpus data as A; the word frequency of the universal drug name corresponding to the subsequence in the drug corpus data is B; when B/A > a first set threshold value is met and B > a second set threshold value is met, the entry data consisting of the subsequence and the corresponding universal drug name are used as entry data entering a target drug dictionary; otherwise, filtering out the entry data. The first set threshold constitutes a preset entry validity judgment condition, and the second set threshold is a preset text validity judgment threshold. The target medicine dictionary can be regarded as performing word frequency statistics on the entry data in the candidate medicine dictionary, and the entry data in the candidate medicine dictionary is filtered according to the word frequency statistical information, so that the target medicine dictionary is obtained. Therefore, the target medicine dictionary generated by the embodiment is an n-gram dictionary constructed according to an n-gram language model, and specifically is a set of entry data corresponding to a series of word strings. Wherein the word string is represented by a subsequence of a text sequence of the drug of the object to be searched; the drug text sequence is a string representing a drug object, preferably a generic drug name of the drug object. The target medicine dictionary stores entry data in the following storage manner: the medicine text sequence corresponding to the subsequence can be indexed by the subsequence. The same subsequence may correspond to one or more different drug text sequences. For example, the subsequence is a partial drug name in the general drug names, from which the complete general drug name corresponding to the partial drug name can be indexed.
In the present embodiment, since the external medicine data that is the basis of the data for generating the target medicine dictionary can largely cover the names of the general-purpose medicines, the target medicine dictionary does not depend on the search amount, and can largely cover the medicine objects within the predetermined search range. When the search content input by the user only contains partial medicine names, the search content is used as a subsequence area to inquire the target medicine dictionary, a complete medicine text sequence corresponding to the subsequence can be predicted, the medicine text sequence is specifically universal medicine names, the corresponding medicine objects can be correctly recalled by using the universal medicine names, and meanwhile, the search performance is guaranteed. And inquiring one or more medicine text sequences by using the target medicine dictionary aiming at the same subsequence, wherein each medicine text sequence is a common medicine name. And the predicted general medicine names can be applied to the pull-down area of the search box to be displayed as prompt words. Applying the target drug dictionary to a drug search, comprising the processes of: acquiring search information of a user; determining a medicine text sequence corresponding to the search information according to the search information and a target medicine dictionary; and acquiring information of at least one drug object corresponding to the drug text sequence, and providing a drug search result aiming at the search information according to the information of the at least one drug object. For example, the information of the drug object is taken as a drug search result and the drug search result is output.
Referring to FIG. 3, in the process of constructing a target drug dictionary, a "candidate n-gram dictionary" is generated by a "generic drug name" using an n-gram modeling method; and performing word frequency statistics through 'external medicine corpus data', and finally obtaining a 'target n-gram dictionary'. The method comprises the following steps: s301, acquiring the total amount of medical products. S302, the general drug name is extracted from the data of the total amount of medical products. And extracting a 'universal drug name' field from the drug instruction book to obtain the information of the universal drug name. Aiming at 'full quantity of medical goods' data with data quantity of hundred million levels, the 'universal drug name' data with data quantity of ten thousand levels is obtained through mining and manual marking. S303, generating a candidate n-gram dictionary based on the generic drug names. The specific mode comprises the following steps: setting a general medicine as a word string W1W2. Wm, wherein Wi (1 < = i < = m) is a character or a Chinese character; the stored entries in the candidate n-gram dictionary are strings of length n, for example when n takes 2: W1W2, W2W3,. And Wm-1Wm are candidate n-grams; storing entry data of the candidate n-gram dictionary in the following storage manner: the generic drug names are indexed by the candidate n-grams. S304, obtaining external medicine corpus data. High-quality medical data of a designated professional medical website and/or an encyclopedia website are collected, and cleaning and screening are carried out to obtain medical corpus data. And S305, carrying out word frequency statistics. The frequency of occurrence of each piece of data in the candidate n-gram dictionary in the "external medicine data" is counted. S306, obtaining a target n-gram dictionary. Entries in the candidate n-gram dictionary that satisfy a certain word frequency condition enter the target n-gram dictionary. For example, assuming that the frequency of the index word string in the candidate n-gram entry is A and the frequency of the corresponding general drug name is B, when B/A > a first set threshold and B > a second set threshold are satisfied, the entry data in the candidate n-gram dictionary enters the target n-gram dictionary, otherwise, the entry data is filtered. It should be noted that, in the case of no conflict, the features given in this embodiment and other embodiments of the present application may be combined with each other, and the steps S201 and S202 or similar terms do not limit the steps to be executed sequentially.
So far, the method provided by the present embodiment is explained, and the method generates a candidate drug dictionary based on a drug text sequence of a drug object to be searched and a subsequence of the drug text sequence; and generating a target medicine dictionary according to the candidate medicine dictionary and the medicine corpus data collected in advance. Because the medicine text sequence and the subsequence thereof are directly mined from the data of the medicine object to be searched, and the medicine corpus data collected in advance can cover the medicine object to a large extent, the target medicine dictionary generated on the basis has high precision. Furthermore, the medicine corpus data is information directionally collected based on a specified professional website, so that the target medicine dictionary does not depend on the search amount of a user, and the search range can be covered by data to a greater extent. The method is used for searching the medicine, solves the problem of low accuracy of medicine searching, and can ensure the searching performance.
The second embodiment is based on the above application environments and embodiments, and provides a medicine searching method. The method is described below with reference to fig. 4. The medicine searching method shown in fig. 4 includes: step S401 to step S403.
Step S401, search information of the user is acquired.
In this embodiment, the target medicine dictionary in the method is used as a preset medicine dictionary, the preset medicine dictionary is queried according to the search information of the user, and the information of the medicine object corresponding to the search information is obtained and output as a medicine search result. The target medicine dictionary is an n-gram dictionary built according to an n-gram language model, and specifically is a set of entry data corresponding to a series of word strings. Wherein the word string is represented by a subsequence of a text sequence of the drug of the object to be searched; the drug text sequence is a word string representing a drug object, preferably a generic drug name of the drug object. The target medicine dictionary stores entry data in the following storage manner: the medicine text sequence corresponding to the subsequence can be indexed by the subsequence. The same subsequence may correspond to one or more different drug text sequences. For example, the subsequence is a partial drug name in the general drug names, from which the complete general drug name corresponding to the partial drug name can be indexed. Wherein the generic drug name is used to search for information of a corresponding drug object. The pharmaceutical object may refer to any object of medical services, medical supplies and medicines.
The search information of the user may be search content input by the user in a search box of the client or a medicine channel search box of the client, and is used for searching information of a medicine object corresponding to the search information. In implementation, if the method is executed by a client, the obtaining search information of the user includes: the client responds to the input trigger of a search box of a medicine channel page, and obtains search words input in the search box by a user, wherein the search words are the search information. Further performing the subsequent steps. If the method is executed by a platform, the obtaining of the search information of the user comprises: the method comprises the steps that a platform receives search terms which are currently associated with a search box of a medicine channel page and sent by a client, wherein the search terms are search information; subsequent steps are further performed using the search term.
Step S402, using the search information to inquire a preset medicine dictionary to obtain a medicine text sequence corresponding to the search information.
In this embodiment, if it is determined that the search information is from a search box of a medicine channel of a client, the search information is used as an index, a preset medicine dictionary is preferentially used to search for entry data matched with the search information, and a complete medicine text sequence matched with the search information is extracted from the entry data, where the medicine text sequence is specifically a universal medicine name of a medicine object. And inquiring the information of the medicine object in a preset search range according to the universal medicine name. The preset search range is a search range formed by objects to be searched for which target medicine dictionaries are generated.
Step S403, acquiring information of at least one medicine object corresponding to the medicine text sequence as a medicine search result aiming at the search information;
the preset medicine dictionary is a target medicine dictionary in the method provided by the embodiment of the application.
In this embodiment, the method further includes: establishing an inverted index table of the medicine objects in advance by taking the medicine text sequence as an index; the inverted index table is used for inquiring the information of the medicine objects according to the medicine text sequence; the acquiring information of at least one drug object corresponding to the drug text sequence comprises: taking the medicine text sequence as an index, and inquiring the inverted index table; and obtaining information of at least one medicine object according to the query result. The inverted index is a word segmentation index method, and the position of a record is determined by an attribute value. Each piece of data of the inverted index table includes an attribute value and an address of each record having the attribute value. Each piece of data in the inverted index table of the medicine object includes a value of a general medicine name and an address of information of each medicine object having the value. After the universal medicine names are obtained in the steps, the inverted index table of the medicine objects is inquired to obtain the information of the medicine objects matched with the universal medicine names.
In this embodiment, the information of the queried medicine object is displayed on the client as a medicine search result for the search content. If the drug search result is a link address of a drug object, the method further comprises: outputting the link address of the drug object; and receiving trigger information aiming at the link address, and outputting a detail page of the medicine object according to the trigger information. A generic drug name may correspond to one or more drug objects having different pharmaceutical product names, then: and receiving trigger information of a link address of a target drug object in the plurality of drug objects, and outputting a detail page of the target drug object.
In this embodiment, the generic drug names found according to the search information and the target drug dictionary may also be displayed as search hint words in a drop-down area of a medicine channel search box of the client. Specifically, the medicine text sequence corresponding to the search information is used as a prompt word, and the prompt word is output. The same search information can be matched with one or more medicine text sequences, and the prompt words are output: including presenting a plurality of selectable prompts to a drop-down area of a search box. The user may select one or more of the plurality of selectable cues and trigger a search for the one or more cues. In one embodiment, if the one or more prompt words are generic drug names matching the search information, the information of the corresponding drug objects is searched for the one or more generic drug names selected by the user.
The medicine searching method provided by the embodiment does not depend on the searching amount, and can cover the whole amount of medical products in a preset searching range. When the search information of the user is part of the drug names, the complete universal drug names can be predicted through the target drug dictionary, the drug objects can be recalled correctly, and meanwhile, the search performance is guaranteed.
The method provided by the embodiment is explained so far, and the method determines a medicine text sequence corresponding to search information according to the search information of a user and a preset medicine dictionary; acquiring information of a medicine object corresponding to the medicine text sequence as a medicine search result aiming at the search information; wherein the preset medicine dictionary: generating a candidate medicine dictionary based on a medicine text sequence of a medicine object to be searched and a subsequence of the medicine text sequence; and generating according to the candidate medicine dictionary and the medicine corpus data collected in advance. Because the medicine text sequence and the subsequence thereof are directly mined from the data of the medicine object to be searched, and the medicine corpus data collected in advance can cover the medicine object to a large extent, the target medicine dictionary generated on the basis has high precision. Furthermore, the medicine corpus data is information directionally collected based on a specified professional website, so that the target medicine dictionary does not depend on the search amount of a user, and the search range can be covered by data to a greater extent. The method is used for searching the medicine, so that the problem of low accuracy of medicine searching is solved, and the searching performance can be ensured.
A third embodiment corresponds to the first embodiment, and a second embodiment of the present application provides a medicine dictionary creating apparatus. The device is described below with reference to fig. 5. The medicine dictionary creation device shown in fig. 5 includes:
a text sequence and subsequence acquiring unit 501, configured to acquire a drug text sequence of a drug object to be searched, and generate a subsequence of the drug text sequence;
a candidate dictionary generating unit 502 for generating a candidate medicine dictionary based on the medicine text sequence and the subsequence of the medicine text sequence;
a target dictionary generating unit 503, configured to generate a target medicine dictionary according to the candidate medicine dictionary and pre-collected medicine corpus data; the target drug dictionary is used for searching target drug objects from drug objects to be searched.
Optionally, the text sequence and subsequence obtaining unit 501 is specifically configured to: generating a subsequence of the drug text sequence based on elements in the drug text sequence and subsequent elements of the elements.
Optionally, the text sequence and subsequence obtaining unit 501 is specifically configured to: determining a step size for generating the sub-sequence; taking the step size as the length of the subsequence; and aiming at the elements in the medicine text sequence, determining one or more elements which are adjacent to the right of the elements according to the length as the subsequent elements of the elements.
Optionally, the candidate dictionary generating unit 502 is specifically configured to: establishing a corresponding relation between the medicine text sequence and each subsequence of the medicine text sequence; and generating entry data according to each subsequence and the medicine text sequence corresponding to the subsequence, and generating the candidate medicine dictionary by using the entry data.
Optionally, the target dictionary generating unit 503 is specifically configured to: and generating a target medicine dictionary according to the occurrence frequency and/or the occurrence times of the entry data in the candidate medicine dictionary in the pre-collected medicine corpus data.
Optionally, the target dictionary generating unit 503 is specifically configured to: determining the occurrence frequency and/or the occurrence frequency of the subsequence of the vocabulary entry data in the candidate medicine dictionary in the medicine corpus data as a first word frequency; determining the occurrence frequency and/or the occurrence frequency of the medicine text sequence corresponding to the subsequence in the entry data in the medicine corpus data as a second word frequency; and if the second word frequency is greater than a preset text validity judgment threshold value and the ratio of the first word frequency to the second word frequency meets a preset entry validity judgment condition, adding the entry data to the target medicine dictionary.
Optionally, the text sequence and subsequence obtaining unit 501 is specifically configured to: acquiring description information and/or labeling information of the medicine object to be searched; and extracting the names of the general medicines from the description information and/or the labeling information to be used as the medicine text sequence.
Optionally, the apparatus further includes a medical information collecting unit, and the medical information collecting unit is configured to: carrying out directional information acquisition on a designated professional website related to medicines to obtain information acquisition data related to the medicines; and filtering and screening the information acquisition data to obtain the medicine corpus data.
Optionally, the medical information collecting unit is specifically configured to: acquiring search result data for the keywords related to the medicines from a first link address corresponding to the specified professional website by using the keywords related to the medicines as search contents; analyzing a second link address from the search result data, and acquiring page content corresponding to the second link address; and analyzing text resources related to the medicine from the page content corresponding to the second link address to serve as the information acquisition data related to the medicine.
Optionally, the medical information collecting unit is specifically configured to: acquiring home page navigation information of the specified professional website; acquiring page contents of columns and/or classification plates related to medicines according to the home page navigation information; and analyzing text resources related to the medicines from the page contents to serve as the information acquisition data related to the medicines.
Fourth embodiment corresponds to the second embodiment, and a fourth embodiment of the present application provides a medicine search device. The device is described below with reference to fig. 6. The medicine search device shown in fig. 6 includes:
a search input unit 601 for acquiring search information of a user;
a search text determining unit 602, configured to query a preset drug dictionary using the search information to obtain a drug text sequence corresponding to the search information;
a search result unit 603, configured to acquire information of at least one drug object corresponding to the drug text sequence as a drug search result for the search information;
the preset medicine dictionary is a target medicine dictionary provided by the method.
Optionally, the search result unit 603 is specifically configured to: establishing an inverted index table of the medicine objects in advance by taking the medicine text sequence as an index; the inverted index table is used for inquiring the information of the medicine objects according to the medicine text sequence; taking the medicine text sequence as an index, and inquiring the inverted index table; and obtaining information of at least one medicine object according to the query result.
Optionally, the drug search result is a link address of the drug object; the search result unit 603 is specifically configured to: outputting the link address of the drug object; and receiving trigger information aiming at the link address, and outputting a detail page of the medicine object according to the trigger information.
Optionally, the search result unit 603 is specifically configured to: and taking the medicine text sequence corresponding to the search information as a prompt word, and outputting the prompt word.
Fifth embodiment a fifth embodiment provides an object searching method based on the above embodiments. In the object searching method provided by this embodiment, a candidate dictionary is generated based on data of an object included in a specific category, and a target dictionary is generated according to the candidate dictionary and corpus data collected for the specific category, where the target dictionary may be used to search for a target object in the object included in the specific category. When a candidate dictionary is generated using object text for a full-scale object of a particular category, entries in the candidate dictionary may override participles for the full-scale object text. And high-quality corpus data is collected aiming at the specific category and is used for screening entries in the candidate dictionary, so that a more accurate target dictionary with a certain coverage degree on the object of the specific category can be obtained.
The method is described below with reference to fig. 7. The object search method shown in fig. 7 includes: step S701 to step S703.
Step S701, an object text sequence of an object associated with a specific category is obtained, and a subsequence of the object text sequence is generated.
The specific category may be a category corresponding to a professional field or an object category that divides objects according to a predetermined standard. Each particular category may have associated therewith a plurality of objects. The objects are objects of the platform or different stores of the platform. For example, a particular category is a certain category of goods of the platform such as the food category, the medicine category, and the like. As another example, the object is a commodity that the platform can query and/or a commodity provided to the user through a transaction link of the platform, such as a certain dish, a certain medicine, etc. The objects associated with a particular category may be full-size objects under the particular category.
The object text sequence refers to text data for representing an object, and one example is an object name. The full amount of objects within a predetermined search range may be overlaid with the object names of the objects associated with the particular category as a basis for constructing the target dictionary, the objects associated with the particular category constituting the predetermined search range. Therefore, one or more similar objects matched with the searched content in the preset search range can be searched, and the efficiency and the accuracy of object recall are improved; the search content is a search word containing a part of or the whole object name. Specifically, the object text sequence may be obtained by: acquiring description information and/or labeling information of the object associated with the specific category; and extracting an object name from the description information and/or the labeling information to be used as the object text sequence. In fact, the data volume scale of the object data of the specific class of the platform is large, and the data volume scale level of the object name obtained by mining can be obviously reduced, so that the data volume level is reduced, and the generation and updating efficiency of the target dictionary can be improved.
In this embodiment, a candidate dictionary needs to be generated according to the target text sequence and the subsequence thereof, and the candidate dictionary is used for screening out entry data for generating the target dictionary in the subsequent steps. Of course, the object text sequence corresponding to the subsequence can also be searched by using the candidate dictionary according to the input subsequence. Specifically, the search term including a part of or the whole object name can be extracted according to the search content input by the user, the candidate dictionary is inquired by taking the search term as a subsequence to obtain the object name corresponding to the search term, and therefore the information of the object corresponding to the complete object name expected to be inquired by the user can be predicted. In practical application, the candidate dictionary may be an n-gram dictionary constructed according to an n-gram language model, and specifically, the candidate dictionary may be a set of entry data corresponding to a series of word strings. Wherein the word string is a word string represented by a subsequence of the object text sequence; the object text sequence is preferably an object name. The candidate dictionary stores entry data in the following storage manner: the object text sequence corresponding to the subsequence is indexable by the subsequence. The same subsequence may correspond to one or more different object text sequences. For example, a subsequence is a portion of text in an object name from which a complete object name corresponding to the portion of text can be indexed. The object name is used for searching information of a corresponding object.
Wherein the sub-sequence of the object text sequence may be generated by: determining a step size for generating the subsequence; taking the step size as the length of the subsequence; for an element in the object text sequence, determining one or more next elements which are right adjacent to the element according to the length; and taking the text sequence formed by the elements and the subsequent elements as a subsequence of the object text sequence. The specific implementation mode is as follows: setting a name string of the universal medicine as W1W2 … Wm, wherein Wi (1 < = i < = m) is one character or one Chinese character; the subsequence n-gram is a string of length n; the step length of n ranges from 2 to m-1; when n takes 2, the subsequences include W1W2, W2W 3. And traversing the step sizes, and extracting substrings from the object names by using a sliding window with each step size as the size, so that all the subsequences can be obtained and used as subsequences for constructing a candidate dictionary.
Step S702, generating a candidate dictionary based on the object text sequence and the subsequence of the object text sequence.
In this embodiment, each piece of data stored in the candidate dictionary is entry data including a subsequence and an object text sequence corresponding to the subsequence. The method specifically comprises the following steps: establishing a corresponding relation between the object text sequence and each subsequence of the object text sequence; and generating entry data according to each subsequence and the object text sequence corresponding to the subsequence, wherein the entry data form the candidate dictionary. Of course the candidate dictionary may be updated regularly to ensure that the object text sequence for the full amount of objects of the particular category is recorded. Therefore, the term data is mined directly from the search range without depending on the context of the user search content, and therefore, the method has no dependence on the search amount.
Step S703, generating a target dictionary according to the candidate dictionary and the corpus data of the specific category collected aiming at the specific category; the target dictionary is used to search for target objects from the objects associated with the particular category.
The language data of the specific category refers to a set of constructed text data related to the specific category. In implementation, the original corpus data may be obtained from internet public resources, and the data obtained after preprocessing the original corpus data may be used as the corpus data of the specific category. The method specifically comprises the following steps: determining a professional website matched with the specific category and acquiring directional information to obtain original corpus data related to the specific category; and filtering and screening the original corpus data to obtain the specific category corpus data. And the professional websites comprise at least one of professional websites of a specific category meeting evaluation criteria and related encyclopedia websites, so that the reliability of the corpus data is ensured. In practice, the raw corpus data may be collected by the following processes: using a keyword related to a specific category as search content, and acquiring search result data aiming at the keyword related to the specific category from a first link address corresponding to the professional website; analyzing a second link address from the search result data, and acquiring page content corresponding to the second link address; and analyzing text resources related to a specific category from the page content corresponding to the second link address to serve as the original corpus data. The raw corpus data may also be collected by the following process: acquiring home page navigation information of the professional website; acquiring page contents of columns and/or classification plates related to specific categories according to the home page navigation information; and analyzing text resources related to specific categories from the page content to serve as the original corpus data. The two embodiments described above may be implemented independently or in combination. The specific category corpus data obtained through the processing is high-quality external corpus data meeting certain quality requirements. The external corpus data is reliable object data acquired by a common professional website resource different from the platform, so as to cover a large range of object names, ensure the coverage range of the generated target dictionary, and ensure that the information of the object corresponding to the user search content is correctly searched from the objects associated with the specific category. The external corpus data is used as a data basis for screening the entry data in the candidate dictionary, and the entry data for forming the candidate dictionary is directly mined from a search range (namely, the data of the object associated with a specific category), so that the dependency on the search content context and the search quantity of a user is avoided, the high-quality entry data can be obtained, and the accurate entry dictionary can be obtained. Moreover, the entry data for forming the target dictionary is filtered by using the external corpus data, so that the problems of accumulation of search logs and incomplete coverage of the object names of low-frequency search can be avoided.
The target dictionary generated in this embodiment is an n-gram dictionary constructed according to an n-gram language model, and specifically is a set of entry data corresponding to a series of word strings. Wherein the word string is a word string represented by a subsequence of the object text sequence; the object text sequence is a word string representing an object, preferably an object name, and may also be an object identifier, object key information, and the like. The target dictionary stores entry data in the following storage manner: the object text sequence corresponding to the subsequence is indexable by the subsequence. The same subsequence may correspond to one or more different dioxide text sequences. For example, a subsequence is a portion of text in an object name from which a complete object name corresponding to the portion of text can be indexed. Generating a target dictionary according to the candidate dictionary and the specific category corpus data collected aiming at the specific category, wherein the generating of the target dictionary comprises the following steps: and generating a target dictionary according to the occurrence frequency and/or the occurrence times of the entry data in the candidate dictionary in the specific category corpus data. Specifically, the method comprises the following steps: determining the occurrence frequency and/or the occurrence frequency of the subsequence of the vocabulary entry data in the candidate dictionary in the specific category corpus data as a first word frequency; determining the occurrence frequency and/or the occurrence frequency of the object text sequence corresponding to the subsequence in the entry data in the specific category corpus data as a second word frequency; and if the second word frequency is greater than a preset text validity judgment threshold value and the ratio of the first word frequency to the second word frequency meets a preset entry validity judgment condition, adding the entry data to the target dictionary. The preset text validity judgment threshold is used for judging the validity of the object text sequence. And presetting an entry validity judgment condition for judging the validity of the entry data in the candidate dictionary. The first word frequency and the second word frequency can represent the occurrence frequency or the occurrence probability of the entry data in the actual corpus environment. For example, the word frequency of the subsequence in the candidate dictionary in the specific category data is A; the word frequency of the object text sequence corresponding to the subsequence in the specific category corpus data is B; when B/A > a first set threshold value is met and B > a second set threshold value is met, the entry data formed by the subsequence and the corresponding object text sequence are used as entry data entering a target dictionary; otherwise, the entry data is filtered. The first set threshold constitutes a preset entry validity judgment condition, and the second set threshold is a preset text validity judgment threshold.
In this embodiment, the target dictionary may be used to search for a target object matching the search content in a particular category. The target dictionary may also be used to predict prompt words matching the search-up content, which are presented in a drop-down area of the search box. The method specifically comprises the following steps: acquiring search content input by a user; querying the target dictionary by using the search content to obtain one or more object text sequences matched with the search content; taking the object text sequence as a search prompt word, and outputting the search prompt word; or acquiring data of at least one object corresponding to the object text sequence, and outputting the search result data as search result data for the search content.
So far, the method provided by the present embodiment is explained, and the method generates a subsequence of an object text sequence by acquiring the object text sequence of an object associated with a specific category; generating a candidate dictionary based on the object text sequence and the subsequence of the object text sequence; generating a target dictionary according to the candidate dictionary and the corpus data of the specific category collected aiming at the specific category; the target dictionary is used to search for target objects from the objects associated with the particular category. The target dictionary generated on the basis of the object text sequence and the subsequence thereof are mined directly from the data of the object associated with the specific category, and the corpus data of the specific category collected aiming at the specific category can cover the object associated with the specific category to a larger extent, so that the precision of the target dictionary is high. Furthermore, the specific category corpus data is information directionally collected based on the professional website, and the quality and the coverage range of the corpus data can be better guaranteed, so that the target dictionary can cover the data of the search range corresponding to the specific category to a greater extent, and the target dictionary does not depend on the search amount of the user. The method is used for object search, so that the problem of low accuracy of object search is solved, and the search performance can be ensured.
Based on the foregoing embodiments, a sixth embodiment of the present application provides an electronic device, and please refer to the corresponding description of the foregoing embodiments for related portions. Referring to fig. 8, the electronic device shown in fig. 8 includes a memory 801 and a processor 802. The memory stores a computer program, and the computer program is executed by the processor to execute the method provided by the embodiment of the application.
Based on the foregoing embodiments, a seventh embodiment of the present application provides a storage device, and please refer to the corresponding description of the foregoing embodiments for related portions. The schematic diagram of the storage device is similar to fig. 8. The storage device stores a computer program, and the computer program is executed by the processor to execute the method provided by the embodiment of the application.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
2. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.

Claims (20)

1. A method for generating a medicine dictionary is characterized by comprising the following steps:
acquiring a medicine text sequence of a medicine object to be searched, and generating a subsequence of the medicine text sequence;
generating a candidate drug dictionary based on the drug text sequence and the subsequence of the drug text sequence, the candidate drug dictionary being composed of entry data including a drug text sequence and a subsequence of the drug text sequence;
generating a target medicine dictionary according to the candidate medicine dictionary and the pre-collected medicine corpus data, wherein the generating step comprises the following steps: if the second word frequency is greater than a preset text validity judgment threshold value and the ratio of the first word frequency to the second word frequency meets a preset entry validity judgment condition, adding the entry data to the target medicine dictionary, wherein the first word frequency is the occurrence frequency and/or the occurrence frequency of a subsequence of the entry data in the medicine corpus data, and the second word frequency is the occurrence frequency and/or the occurrence frequency of a medicine text sequence corresponding to the subsequence in the medicine corpus data; the target drug dictionary is used for searching target drug objects from drug objects to be searched.
2. The method of claim 1, wherein generating the subsequence of the text sequence of drugs comprises:
generating a subsequence of the drug text sequence based on elements in the drug text sequence and subsequent elements of the elements.
3. The method of claim 2, further comprising:
determining a step size for generating the subsequence;
taking the step size as the length of the subsequence;
and aiming at the elements in the medicine text sequence, determining one or more elements which are adjacent to the right of the elements according to the length as the subsequent elements of the elements.
4. The method of claim 1, wherein generating a candidate drug dictionary based on the drug text sequence and subsequences of the drug text sequence comprises:
establishing a corresponding relation between the medicine text sequence and each subsequence of the medicine text sequence;
and generating entry data according to each subsequence and the medicine text sequence corresponding to the subsequence, and generating the candidate medicine dictionary by using the entry data.
5. The method of claim 1, wherein the obtaining a drug text sequence of a drug object to be searched comprises:
acquiring the description information and/or the labeling information of the medicine object to be searched;
and extracting the names of the general medicines from the description information and/or the labeling information to be used as the medicine text sequence.
6. The method of claim 1, further comprising:
carrying out directional information acquisition on a designated professional website related to medicines to obtain information acquisition data related to the medicines;
and filtering and screening the information acquisition data to obtain the medicine corpus data.
7. The method of claim 6, wherein the performing targeted information collection on a designated professional website related to medicine to obtain information collection data related to medicine comprises:
acquiring search result data aiming at the keywords relevant to the medicines from a first link address corresponding to the specified professional website by using the keywords relevant to the medicines as search contents;
analyzing a second link address from the search result data, and acquiring page content corresponding to the second link address;
and analyzing text resources related to the medicine from the page content corresponding to the second link address to serve as the information acquisition data related to the medicine.
8. The method of claim 6, wherein the performing targeted information collection on a specific professional website related to medicine to obtain information collection data related to medicine comprises:
acquiring home page navigation information of the specified professional website;
acquiring page contents of columns and/or classification plates related to medicines according to the home page navigation information;
and analyzing text resources related to the medicines from the page content to serve as the information acquisition data related to the medicines.
9. A method for searching for a drug, comprising:
acquiring search information of a user;
using the search information to query a preset medicine dictionary to obtain a medicine text sequence corresponding to the search information;
acquiring information of at least one medicine object corresponding to the medicine text sequence as a medicine search result aiming at the search information;
wherein the preset medicine dictionary is the target medicine dictionary of any one of claims 1 to 8.
10. The method of claim 9, further comprising:
establishing an inverted index table of the medicine objects in advance by taking the medicine text sequence as an index; the inverted index table is used for inquiring the information of the medicine objects according to the medicine text sequence;
the acquiring information of at least one drug object corresponding to the drug text sequence includes:
taking the medicine text sequence as an index, and inquiring the inverted index table;
and obtaining information of at least one medicine object according to the query result.
11. The method of claim 9, wherein the drug search result is a link address of a drug object;
the method further comprises the following steps:
outputting the link address of the drug object;
and receiving trigger information aiming at the link address, and outputting a detail page of the medicine object according to the trigger information.
12. The method of claim 9, further comprising:
and taking the medicine text sequence corresponding to the search information as a prompt word, and outputting the prompt word.
13. A medicine dictionary creating apparatus, comprising:
the device comprises a text sequence and subsequence acquisition unit, a search unit and a processing unit, wherein the text sequence and subsequence acquisition unit is used for acquiring a medicine text sequence of a medicine object to be searched and generating a subsequence of the medicine text sequence;
a candidate dictionary generating unit configured to generate a candidate medicine dictionary based on the medicine text sequence and a subsequence of the medicine text sequence, the candidate medicine dictionary being configured of entry data including the medicine text sequence and the subsequence of the medicine text sequence;
a target dictionary generating unit, configured to generate a target medicine dictionary according to the candidate medicine dictionary and pre-collected medicine corpus data, where the target dictionary generating unit is specifically configured to: if the second word frequency is greater than a preset text validity judgment threshold value and the ratio of the first word frequency to the second word frequency meets a preset entry validity judgment condition, adding the entry data to the target medicine dictionary, wherein the first word frequency is the occurrence frequency and/or the occurrence frequency of a subsequence of the entry data in the medicine corpus data, and the second word frequency is the occurrence frequency and/or the occurrence frequency of a medicine text sequence corresponding to the subsequence in the medicine corpus data; the target drug dictionary is used for searching target drug objects from drug objects to be searched.
14. A medicine search device, comprising:
a search input unit for acquiring search information of a user;
the search text determining unit is used for inquiring a preset medicine dictionary by using the search information to obtain a medicine text sequence corresponding to the search information;
a search result unit, configured to acquire information of at least one drug object corresponding to the drug text sequence as a drug search result for the search information;
wherein the preset medicine dictionary is the target medicine dictionary of any one of claims 1 to 8.
15. An object search method, comprising:
acquiring an object text sequence of an object associated with a specific category, and generating a subsequence of the object text sequence;
generating a candidate dictionary based on the object text sequence and subsequences of the object text sequence, the candidate dictionary being composed of term data comprising an object text sequence and subsequences of the object text sequence;
generating a target dictionary according to the candidate dictionary and the specific category corpus data collected aiming at the specific category, wherein the target dictionary comprises the following steps: if the second word frequency is greater than a preset text validity judgment threshold value and the ratio of the first word frequency to the second word frequency meets a preset entry validity judgment condition, adding the entry data to the target dictionary, wherein the first word frequency is the occurrence frequency and/or the occurrence frequency of a subsequence of the entry data in the specific category corpus data, and the second word frequency is the occurrence frequency and/or the occurrence frequency of an object text sequence corresponding to the subsequence in the specific category corpus data; the target dictionary is used to search for target objects from the objects associated with the particular category.
16. The method of claim 15, wherein generating the subsequence of the object text sequence comprises:
determining a step size for generating the sub-sequence;
taking the step size as the length of the subsequence;
for an element in the object text sequence, determining one or more next elements which are right adjacent to the element according to the length;
and taking the text sequence formed by the elements and the subsequent elements as a subsequence of the object text sequence.
17. The method of claim 15, further comprising:
determining a professional website matched with the specific category and acquiring directional information to obtain original corpus data related to the specific category;
and filtering and screening the original corpus data to obtain the specific category corpus data.
18. The method of claim 15, further comprising:
acquiring search content input by a user;
querying the target dictionary by using the search content to obtain one or more object text sequences matched with the search content;
taking the object text sequence as a search prompt word, and outputting the search prompt word; or acquiring data of at least one object corresponding to the object text sequence, and outputting the search result data as search result data for the search content.
19. An electronic device, comprising:
a memory, and a processor; the memory is adapted to store a computer program which, when executed by the processor, performs the method of any of claims 1-12 and 15-18.
20. A storage device, characterized in that a computer program is stored which, when executed by a processor, performs the method of any one of claims 1-12 and 15-18.
CN202110025121.1A 2021-01-08 2021-01-08 Medicine dictionary generation and medicine search method and device Active CN112687403B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110025121.1A CN112687403B (en) 2021-01-08 2021-01-08 Medicine dictionary generation and medicine search method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110025121.1A CN112687403B (en) 2021-01-08 2021-01-08 Medicine dictionary generation and medicine search method and device

Publications (2)

Publication Number Publication Date
CN112687403A CN112687403A (en) 2021-04-20
CN112687403B true CN112687403B (en) 2022-12-02

Family

ID=75456809

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110025121.1A Active CN112687403B (en) 2021-01-08 2021-01-08 Medicine dictionary generation and medicine search method and device

Country Status (1)

Country Link
CN (1) CN112687403B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113312523B (en) * 2021-07-30 2021-12-14 北京达佳互联信息技术有限公司 Dictionary generation and search keyword recommendation method and device and server
CN115831314A (en) * 2023-02-16 2023-03-21 江苏曼荼罗软件股份有限公司 Medicine getting application method and system based on medical advice decomposition

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102169495A (en) * 2011-04-11 2011-08-31 趣拿开曼群岛有限公司 Industry dictionary generating method and device
CN105630836A (en) * 2014-11-05 2016-06-01 阿里巴巴集团控股有限公司 Searching result sorting method and apparatus
CN109284397A (en) * 2018-09-27 2019-01-29 深圳大学 A kind of construction method of domain lexicon, device, equipment and storage medium
CN110490712A (en) * 2019-08-21 2019-11-22 浙江中国轻纺城网络有限公司 A kind of commodity class heading search method, system and storage medium
CN111723570A (en) * 2020-06-09 2020-09-29 平安科技(深圳)有限公司 Medicine knowledge graph construction method and device and computer equipment
CN111985241A (en) * 2020-09-03 2020-11-24 平安国际智慧城市科技股份有限公司 Medical information query method, device, electronic equipment and medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102169495A (en) * 2011-04-11 2011-08-31 趣拿开曼群岛有限公司 Industry dictionary generating method and device
CN105630836A (en) * 2014-11-05 2016-06-01 阿里巴巴集团控股有限公司 Searching result sorting method and apparatus
CN109284397A (en) * 2018-09-27 2019-01-29 深圳大学 A kind of construction method of domain lexicon, device, equipment and storage medium
CN110490712A (en) * 2019-08-21 2019-11-22 浙江中国轻纺城网络有限公司 A kind of commodity class heading search method, system and storage medium
CN111723570A (en) * 2020-06-09 2020-09-29 平安科技(深圳)有限公司 Medicine knowledge graph construction method and device and computer equipment
CN111985241A (en) * 2020-09-03 2020-11-24 平安国际智慧城市科技股份有限公司 Medical information query method, device, electronic equipment and medium

Also Published As

Publication number Publication date
CN112687403A (en) 2021-04-20

Similar Documents

Publication Publication Date Title
CN110321408B (en) Searching method and device based on knowledge graph, computer equipment and storage medium
US10943064B2 (en) Tabular data compilation
Jatowt et al. Estimating document focus time
US20060212441A1 (en) Full text query and search systems and methods of use
US10552467B2 (en) System and method for language sensitive contextual searching
US20120323839A1 (en) Entity recognition using probabilities for out-of-collection data
US20130110839A1 (en) Constructing an analysis of a document
CN111008265A (en) Enterprise information searching method and device
CN103425687A (en) Retrieval method and system based on queries
US7555428B1 (en) System and method for identifying compounds through iterative analysis
CN112687403B (en) Medicine dictionary generation and medicine search method and device
CN101350027B (en) Content retrieving device and retrieving method
JP2012533819A (en) Method and system for document indexing and data querying
US20150206101A1 (en) System for determining infringement of copyright based on the text reference point and method thereof
CN110866091A (en) Data retrieval method and device
Kılınç An accurate toponym-matching measure based on approximate string matching
KR101753768B1 (en) A knowledge management system of searching documents on categories by using weights
CN108345694B (en) Document retrieval method and system based on theme database
US20090144222A1 (en) Chart generator for searching research data
US20090144318A1 (en) System for searching research data
US20090144241A1 (en) Search term parser for searching research data
CN116226515B (en) Search result ordering method and device, electronic equipment and storage medium
US20090144265A1 (en) Search engine for searching research data
Khan et al. Metadata for Efficient Management of Digital News Articles in Multilingual News Archives
US20090144242A1 (en) Indexer for searching research data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant