CN116795947A - Document recommendation method, device, electronic equipment and computer readable storage medium - Google Patents

Document recommendation method, device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN116795947A
CN116795947A CN202210254300.7A CN202210254300A CN116795947A CN 116795947 A CN116795947 A CN 116795947A CN 202210254300 A CN202210254300 A CN 202210254300A CN 116795947 A CN116795947 A CN 116795947A
Authority
CN
China
Prior art keywords
document
keywords
candidate words
information
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210254300.7A
Other languages
Chinese (zh)
Inventor
刘冰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TCL Technology Group Co Ltd
Original Assignee
TCL Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TCL Technology Group Co Ltd filed Critical TCL Technology Group Co Ltd
Priority to CN202210254300.7A priority Critical patent/CN116795947A/en
Publication of CN116795947A publication Critical patent/CN116795947A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a document recommending method, a device, electronic equipment and a computer readable storage medium, wherein the method comprises the following steps: respectively determining statistical information of the documents in the document dataset and candidate words corresponding to the documents; determining keywords of the document according to the statistical information and the candidate words; and matching the acquired historical browsing information of the target user with the keywords to obtain a document recommendation result corresponding to the target user. The method and the device have high document recommendation efficiency, and simultaneously improve the accuracy of document recommendation.

Description

Document recommendation method, device, electronic equipment and computer readable storage medium
Technical Field
The application relates to the technical field of natural language processing, in particular to a document recommending method, a document recommending device, electronic equipment and a computer readable storage medium.
Background
In industrial data, because of strong knowledge in the professional field, a plurality of technical documents which are summarized by scientific research experts are difficult to be summarized into a complete knowledge system through manual finishing, and the follow-up staff are inconvenient to learn and review; moreover, these document data are cumbersome, have no organic link to each other, and are mostly unstructured data.
Therefore, how to integrate and filter these document data, and more precisely recommend useful document information in related fields to users is a technical problem to be solved.
Disclosure of Invention
The embodiment of the application provides a document recommending method, a document recommending device, electronic equipment and a computer readable storage medium, so as to improve document recommending efficiency and accuracy.
The embodiment of the application provides a document recommending method, which comprises the following steps:
respectively determining statistical information of the documents in the document dataset and candidate words corresponding to the documents;
determining keywords of the document according to the statistical information and the candidate words;
and matching the acquired historical browsing information of the target user with the keywords to obtain a document recommendation result corresponding to the target user.
The embodiment of the application also provides a document recommending device, which comprises:
a first determining unit, configured to determine statistical information of a document in the document dataset and candidate words corresponding to the document, respectively;
a second determining unit for determining keywords of the document according to the statistical information and the candidate words;
and the recommending unit is used for matching the acquired historical browsing information of the target user with the keywords to obtain a document recommending result corresponding to the target user.
The embodiment of the application also provides electronic equipment, which comprises a processor, a memory and a document recommendation program stored in the memory and capable of running on the processor, wherein the processor executes the document recommendation program to realize the steps in the document recommendation method.
The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a document recommendation program, and the document recommendation program is executed by a processor to realize the steps in the document recommendation method.
According to the application, the statistical information of each document in the document data set is firstly obtained, then the documents are subjected to word segmentation processing, so that candidate words corresponding to the documents are obtained, and then the keywords of the documents are extracted from the candidate words based on the statistical information of the documents.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic view of a document recommendation system according to an embodiment of the present application.
Fig. 2 is a schematic flow chart of a document recommendation method according to an embodiment of the present application.
Fig. 3 is a schematic structural diagram of a document recommendation apparatus according to an embodiment of the present application.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.
The embodiment of the application provides a document recommending method, a document recommending device, electronic equipment and a computer readable storage medium.
The document recommending device can be integrated in electronic equipment, and the electronic equipment can be a terminal, a server and other equipment. The terminal can be a mobile phone, a tablet computer, an intelligent Bluetooth device, a notebook computer, a personal computer (Personal Computer, PC) or the like; the server may be a single server or a server cluster composed of a plurality of servers.
In some embodiments, the document recommendation apparatus may be integrated in a plurality of electronic devices, for example, the document recommendation apparatus may be integrated in a plurality of servers, and the document recommendation method of the present application is implemented by the plurality of servers.
In some embodiments, the server may also be implemented in the form of a terminal.
For example, taking the example that the document recommending device is integrated in an electronic device, the electronic device can respectively determine statistical information of the documents in the document dataset and candidate words corresponding to the documents; determining keywords of the document according to the statistical information and the candidate words; and matching the acquired historical browsing information of the target user with the keywords to obtain a document recommendation result corresponding to the target user.
Referring to fig. 1, fig. 1 is a schematic view of a scenario of a document recommendation system provided in an embodiment of the present application, where the system may include a server 10 and a storage terminal 11, the storage terminal 11 may store a document data set, and the server 10 and the storage terminal 11 are in communication connection, which is not described herein.
Wherein the server 10 may include a processor, a memory, and the like; the storage terminal 11 may include a cloud server or the like.
It should be noted that, the schematic system scenario shown in fig. 1 is only an example, and the servers and the scenarios described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and those skilled in the art can know that, with the evolution of the system and the appearance of a new service scenario, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems. The following will describe in detail. The following description of the embodiments is not intended to limit the preferred embodiments.
The embodiment of the disclosure aims at providing a document recommending method, which can respectively determine statistical information of documents in a document dataset and candidate words corresponding to the documents; determining keywords of the document according to the statistical information and the candidate words; and matching the acquired historical browsing information of the target user with the keywords to obtain a document recommendation result corresponding to the target user.
As shown in fig. 2, the specific flow of the document recommendation method may be as follows:
210. and respectively determining statistical information of the documents in the document dataset and candidate words corresponding to the documents.
In an embodiment of the application, the document dataset may comprise a number of different documents. The statistical information may include statistical information based on word weights, statistical information based on the location of words in the document, and/or corresponding statistical information based on the association of words in the document, and so forth. The embodiment of the application can also perform word segmentation on the document so as to obtain candidate words corresponding to the document after word segmentation.
Wherein, the statistical information based on word weight can include part of speech, word frequency, reverse document frequency, relative word frequency or word length, etc.; statistical information based on the location of words in a document may include the first N words, the last N words, the beginning of a segment, the end of a segment, a title or introduction, etc. of the document; the statistics corresponding to the association information based on the words in the document may include mutual information, hits values, contribution, dependency or TF-IDF values, etc.
In some embodiments, determining candidate words for a document in a document dataset includes:
and performing word segmentation processing on the documents in the document dataset to obtain candidate words corresponding to the documents.
In the embodiment of the application, word segmentation is to process each sentence in the document into a plurality of words, and the plurality of words are candidate words corresponding to the document. For example, if the sentence to be segmented is "window front bright moon light", the sentence "window front bright moon light" can be divided into "window front", "open moon" and "light" by the segmentation processing.
In some embodiments, word segmentation processing is performed on a document in a document dataset to obtain candidate words corresponding to the document, including:
determining the format of a document in a document dataset and a domain dictionary corresponding to the document respectively;
converting the format of the document into a preset format to obtain the document with the converted format;
updating the domain dictionary according to the document after format conversion to obtain an updated domain dictionary;
based on the updated domain dictionary, word segmentation is carried out on the document, and a plurality of candidate words corresponding to the document after the word segmentation are obtained.
In the embodiment of the application, the plurality of documents can belong to different technical fields, and the plurality of documents in different technical fields can comprise documents in biological fields, documents in artificial intelligence fields, documents in chemical fields and the like. Different domain dictionaries can be set according to different technical domains, and the domain dictionaries are used for storing keywords corresponding to the technical domains. In addition, several documents may be in different formats, such as pptx, ppt, xlsx or doc formats, etc. The preset format may be a text format or the like. Mutual information reflects the internal information of the text segment, namely, the combination frequency of each word appearing adjacently in the text segment is counted, and whether the f word is formed or not is judged by calculating the mutual information between adjacent words in the word string with the length of f. The information entropy reflects the external information of the text segment, namely whether the segment has rich left-right collocation.
Before extracting keywords from a document, the embodiment of the application can uniformly convert the documents with different formats into a preset format, such as a text format, so as to obtain the document with the converted format. According to the embodiment of the application, new words can be mined from the document after format conversion by utilizing information entropy and mutual information, a domain dictionary is updated after the new words are extracted, word segmentation processing is performed on the document based on the domain dictionary, and then keywords of the document are extracted from a plurality of candidate words corresponding to the document based on statistical information of the document.
In some embodiments, updating the domain dictionary from the format-converted document to obtain an updated domain dictionary, comprising:
extracting new words of the document after format conversion based on mutual information and information entropy;
and updating the domain dictionary according to the new words to obtain an updated domain dictionary.
In the embodiment of the application, extracting new words from the document after format conversion based on mutual information and information entropy can comprise: randomly extracting a fragment from the document after format conversion; determining the information entropy of the fragment; when the information entropy of the segment is larger than an information entropy threshold value, obtaining a first candidate feature word; determining mutual information of the first candidate feature words; when the mutual information of the first candidate feature words is larger than a mutual information threshold value, obtaining second candidate feature words; determining the solidification degree of the second candidate feature word; and when the solidification degree of the second candidate feature words is larger than the solidification degree threshold value, taking the second candidate feature words as new words.
The embodiment of the application can randomly extract a fragment from the document after format conversion, and the fragment can be a text fragment and the like. The embodiment of the application firstly determines the information entropy of the segment, and when the information entropy of the segment is larger than the information entropy threshold, the segment can be used as a first candidate feature word; and then determining mutual information of the first candidate feature words, and when the mutual information is larger than a mutual information threshold value, indicating that the higher the correlation between the characters in the segment is, wherein the segment can be further used as a second candidate feature word. The degree of solidification is the degree of tightness between words within a segment. The degree of solidification from word to word in words such as "glass", "durian" is very high.
In addition, since there may be words of the modulus Ling Liangke, such as the two words "size" and "small window", in the second candidate feature word, their internal degrees of solidification are large, which may cause the word "size window" to appear. In order to better reject the words of the module Ling Liangke, the embodiment of the application may also set a relatively high freezing degree threshold, and delete the second candidate feature word if the freezing degree of the second candidate feature word is less than the freezing degree threshold. The application can cut words better under the condition of larger mutual information threshold value, and exclude words of the module Ling Liangke.
In some embodiments, when the information entropy of the segment is greater than the information entropy threshold, obtaining the first candidate feature word, where the information entropy of the segment includes a left information entropy and a right information entropy, including:
and when the left information entropy is larger than the left information entropy threshold and the right information entropy is larger than the right information entropy threshold, the segment is used as a first candidate feature word. In the embodiment of the application, the left information entropy of the segment w is defined asThe right information entropy of the segment w is defined asx and y are both strings, x is a left string, y is a right string, p (x|w) represents the probability that string x is a left adjacent string of segment w, and p (y|w) represents the probability that string x is a right adjacent string of segment w.
In some embodiments, determining mutual information for the first candidate feature word includes:
determining the probability of occurrence of the first candidate feature word in the document;
determining the probability of the independent occurrence of the left character string of the first candidate feature word and the probability of the independent occurrence of the right character string of the first candidate feature word;
obtaining the product of the probability of the single occurrence of the left character string of the first candidate feature word and the probability of the single occurrence of the right character string of the first candidate feature word, and obtaining a probability product value;
calculating the ratio between the probability of the first candidate feature word in the document and the probability product value to obtain a probability ratio;
and taking the logarithm of the probability ratio to obtain the mutual information of the first candidate feature words.
In the embodiment of the present application, the mutual information of the first candidate feature word is defined as the product of the probability of the segment/the probability of the subsequence, and the segment may have a plurality of subsequences, and the mutual information of the segment may be accumulated as the final aggregation degree.
Wherein, mutual informationP (x, y) is the probability that the first candidate feature word appears in the document, P (x) is the probability that the left character string of the first candidate feature word appears alone, and P (y) is the probability that the right character string of the first candidate feature word appears alone.
220. Keywords of the document are determined according to the statistical information and the candidate words.
In some embodiments, determining keywords for a document based on statistical information and candidate words includes:
processing the candidate words according to the statistical information to obtain feature quantization values corresponding to the candidate words;
and selecting keywords of the document from the candidate words according to the characteristic quantization values.
In the embodiment of the application, the candidate words are subjected to characteristic quantization processing according to the statistical information, which can comprise characteristic quantization based on the weight of the candidate words in the document, characteristic quantization based on the position of the candidate words in the document, characteristic quantization based on the association information of the candidate words in the document and the like. According to the embodiment of the application, the candidate words are subjected to characteristic quantization processing based on the statistical information of the document, the characteristic quantization value corresponding to the candidate words can be obtained, and then the keywords of the document can be selected from the candidate words according to the characteristic quantization value.
In some embodiments, processing the candidate word according to the statistical information to obtain a feature quantization value corresponding to the candidate word includes:
respectively determining the weight, the position and the associated information of the candidate words in the document according to the statistical information;
performing feature quantization based on weights of the candidate words in the document to obtain weight quantization values corresponding to the candidate words;
performing feature quantization based on the position of the candidate word in the document to obtain a position quantization value corresponding to the candidate word;
performing feature quantization based on the association information of the candidate words in the document to obtain association information quantization values corresponding to the candidate words;
and weighting the weight quantized value, the position quantized value and the associated information quantized value to obtain the characteristic quantized value corresponding to the candidate word.
In the embodiment of the application, the weight of the candidate word in the document can comprise part of speech, word frequency, reverse document frequency, relative word frequency or word length and the like; the position of the word in the document may include the first N words, the last N words, the beginning of the paragraph, the end of the paragraph, the title or introduction, etc. of the document; the association information of words in a document may include mutual information, hits values, contribution, dependency or TF-IDF values, etc.
The application can firstly obtain noAnd the same quantized value, different quantized values can comprise a weight quantized value, a position quantized value and an associated information quantized value, then the weighted summation is carried out according to the different quantized values, and finally the weighted summation value is used as the characteristic quantized value of the candidate word. Characteristic quantized value t=eta corresponding to candidate word 1 t 12 t 23 t 3 ,t 1 For the quantized value of the weight, t 2 For position quantized value, t 3 For the associated information quantization value, η 1 For the weight of the weight quantized value, η 2 Weighting the position quantized values, η 3 Weights for the associated information quantized values.
In some embodiments, selecting keywords of a document from candidate words according to feature quantization values includes:
sorting the feature quantized values corresponding to the candidate words according to the sequence from high to low to obtain a feature quantized sorting result;
determining candidate words corresponding to the first k feature quantized values in the feature quantized sequencing result, wherein k is more than or equal to 1 and less than or equal to n, and n is the number of feature quantized values corresponding to the candidate words;
and taking candidate words corresponding to the first k feature quantized values in the feature quantized sequencing result as keywords of the document.
230. And matching the acquired historical browsing information of the target user with the keywords to obtain a document recommendation result corresponding to the target user.
In the embodiment of the application, the historical browsing information of the target user can comprise a search request of the target user, document information uploaded by the target user, browsing record information of the target user and the like; wherein the search request of the target user may include a search keyword or the like.
According to the embodiment of the application, the keywords corresponding to the historical browsing information of the user can be determined according to the historical browsing information of the user, the document data set comprises different documents, each document is provided with a plurality of corresponding keywords, the keywords corresponding to the historical browsing information of the user are matched with the keywords of the different documents in the document data set, and the document recommendation result corresponding to the user can be obtained.
In some embodiments, matching the obtained historical browsing information of the target user with the keywords to obtain a document recommendation result corresponding to the target user includes:
acquiring historical browsing information of a target user in a preset period;
word segmentation processing is carried out on the historical browsing information, and keywords corresponding to the historical browsing information are obtained;
determining the embedding of the keywords corresponding to the historical browsing information according to the keywords corresponding to the historical browsing information;
determining keyword embedding corresponding to keywords of each document in the document data set;
calculating first similarity between keyword embedding corresponding to the historical browsing information and keyword embedding corresponding to keywords of each document in the document dataset;
sequencing the first similarity according to the sequence from high to low to obtain a first similarity sequencing result;
determining the documents corresponding to the first m similarity in the first similarity sorting result, wherein m is more than or equal to 1 and less than or equal to h, and h is the number of the first similarity between the keyword embedments corresponding to the keywords of each document;
and taking the documents corresponding to the first m similarity in the first similarity sorting result as document recommendation results corresponding to the target user.
In the embodiment of the application, each word corresponds to a unique vector in the vector space, and the keyword embedding corresponding to the historical browsing information is determined according to the keyword corresponding to the historical browsing information, namely, the keyword corresponding to the historical browsing information can be converted into a word vector (namely, the keyword embedding corresponding to the historical browsing information) through a word embedding method (such as word2 vec), and similarly, the keyword of each document in the document data set can also be converted into a word vector (namely, the keyword embedding corresponding to the keyword of each document). The first similarity is the similarity between the keyword embedding corresponding to the historical browsing information and the keyword embedding corresponding to the keywords of each document in the document data set, and the first similarity can be cosine similarity or other similarity.
According to the embodiment of the application, for the target user with the search browsing records, the nearest behavior record (such as the nearest 10 behavior records) of the target user can be selected, the behavior record is the historical browsing information of the target user, the keywords of the behavior record are obtained, then the ebedding of the keywords of the behavior record (namely the keyword embedding corresponding to the historical browsing information) is calculated, then the similarity between the keyword embedding corresponding to the keywords of each document is calculated, the results are reordered, and the documents corresponding to the highest first few similarity are taken as the knowledge document recommendation of the user. In addition, for a target user without searching browsing records, when the target user has records of uploading documents, keywords of the uploaded documents can be obtained, keyword embedding corresponding to the keywords of the uploaded documents is determined, the similarity between the keyword embedding corresponding to the keywords of the uploaded documents and the keyword embedding corresponding to the keywords of each document is calculated, the result is reordered, and the documents corresponding to the highest first few similarities are taken as knowledge document recommendation of the user.
In some embodiments, the matching the obtained historical browsing information of the target user with the keywords to obtain the document recommendation result corresponding to the target user further includes:
when the historical browsing information of the target user is not detected, counting the browsing times corresponding to each document in the document data set in the counting time length;
based on the browsing times corresponding to each document in the document data set, acquiring the first z documents with highest browsing times, wherein z is more than or equal to 1 and less than or equal to w, and w is the total number of the documents in the document data set in the statistical time period;
and taking the first z documents with the highest browsing times as document recommendation results corresponding to the target users.
In the embodiment of the application, for the user who does not search the browsing record and does not upload the document, the first several documents which are browsed by other users in the statistical time period (for example, every day) can be obtained as the document recommendation results corresponding to the user.
In some embodiments, after matching the obtained historical browsing information of the target user with the keywords to obtain the document recommendation result corresponding to the target user, the method further includes:
acquiring a search request of a target user;
word segmentation processing is carried out on the search request to obtain search keywords;
determining the embedding of the search keywords corresponding to the search keywords;
determining keyword embedding corresponding to keywords of each document in the document data set;
determining a second similarity between the search keyword embedding and the keyword embedding corresponding to the keywords of each document in the document dataset;
sequencing the second similarity to obtain a second similarity sequencing result;
determining the maximum value in the second similarity sorting result, and acquiring a document corresponding to the maximum value;
and taking the document corresponding to the maximum value as a target document corresponding to the search request.
In the embodiment of the present application, the second similarity is a similarity between the embedding of the search keyword and the embedding of the keyword corresponding to the keyword of each document in the document data set, and the second similarity may be a cosine similarity or the like.
The embodiment of the application can extract keywords from all documents, store basic information of the documents, including items, titles, first sentence contents of the documents and the like, and calculate the keyword embedding corresponding to the keywords of each document by using a word embedding method. According to the embodiment of the application, the grading can be quickly searched according to the user search request, after the user search content is further cleaned and segmented, the similarity (namely, the second similarity) between the keyword embedments of the search keywords and the keyword embedments corresponding to the keywords of each document is calculated, so that the document corresponding to the maximum value is obtained based on the maximum value in the second similarity sorting result, and the document corresponding to the maximum value is used as the document corresponding to the search request. According to the document recommendation method provided by the embodiment of the application, the statistical information of the documents in the document dataset and the candidate words corresponding to the documents can be respectively determined; then determining keywords of the document according to the statistical information and the candidate words; and matching the acquired historical browsing information of the target user with the keywords to obtain a document recommendation result corresponding to the target user.
According to the application, the statistical information of each document in the document data set is firstly obtained, then the documents are subjected to word segmentation processing, so that candidate words corresponding to the documents are obtained, then the candidate words are subjected to characteristic quantization processing based on the statistical information of the documents, so that characteristic quantization values corresponding to the candidate words are obtained, and then the keywords of the documents are extracted from the candidate words according to the characteristic quantization values.
In order to better implement the method, the embodiment of the application also provides a document recommending device which can be integrated in electronic equipment, wherein the electronic equipment can be a terminal, a server and the like. The terminal can be a mobile phone, a tablet personal computer, an intelligent Bluetooth device, a notebook computer, a personal computer and other devices; the server may be a single server or a server cluster composed of a plurality of servers.
For example, in the present embodiment, a method according to an embodiment of the present application will be described in detail by taking a document recommendation apparatus specifically integrated in a document recommendation server as an example.
For example, as shown in fig. 3, the document recommending apparatus may include a first determining unit 310, a second determining unit 320, and a recommending unit 330, as follows:
first determining unit 310
A first determining unit 310, configured to determine statistical information of the documents in the document dataset and candidate words corresponding to the documents, respectively.
In some embodiments, the first determining unit 310 includes a word segmentation unit configured to:
and performing word segmentation processing on the documents in the document dataset to obtain candidate words corresponding to the documents.
In some embodiments, the word segmentation unit includes a word segmentation subunit for:
determining the format of a document in a document dataset and a domain dictionary corresponding to the document respectively;
converting the format of the document into a preset format to obtain the document with the converted format;
updating the domain dictionary according to the document after format conversion to obtain an updated domain dictionary;
based on the updated domain dictionary, word segmentation is carried out on the document, and a plurality of candidate words corresponding to the document after the word segmentation are obtained.
In some embodiments, the word segmentation subunit includes a dictionary updating unit for:
updating the domain dictionary according to the document after format conversion to obtain an updated domain dictionary, comprising:
extracting new words of the document after format conversion based on mutual information and information entropy;
and updating the domain dictionary according to the new words to obtain an updated domain dictionary.
(two) the second determination unit 320
The second determining unit 320 is configured to determine keywords of the document according to the statistical information and the candidate words.
In some embodiments, the second determining unit 320 includes a feature quantization unit for:
processing the candidate words according to the statistical information to obtain feature quantization values corresponding to the candidate words;
and selecting keywords of the document from the candidate words according to the characteristic quantization values.
In some embodiments, the feature quantization unit comprises a feature quantization subunit for:
respectively determining the weight, the position and the associated information of the candidate words in the document according to the statistical information;
performing feature quantization based on weights of the candidate words in the document to obtain weight quantization values corresponding to the candidate words;
performing feature quantization based on the position of the candidate word in the document to obtain a position quantization value corresponding to the candidate word;
performing feature quantization based on the association information of the candidate words in the document to obtain association information quantization values corresponding to the candidate words;
and weighting the weight quantized value, the position quantized value and the associated information quantized value to obtain the characteristic quantized value corresponding to the candidate word.
In some embodiments, the second determining unit 320 further includes a keyword determining unit for:
sorting the feature quantized values corresponding to the candidate words according to the sequence from high to low to obtain a feature quantized sorting result;
determining candidate words corresponding to the first k feature quantized values in the feature quantized sequencing result, wherein k is more than or equal to 1 and less than or equal to n, and n is the number of feature quantized values corresponding to the candidate words;
and taking candidate words corresponding to the first k feature quantized values in the feature quantized sequencing result as keywords of the document.
(III) recommendation Unit 330
And the recommending unit is used for matching the acquired historical browsing information of the target user with the keywords to obtain a document recommending result corresponding to the target user.
In some embodiments, the recommendation unit 330 includes a first recommendation unit to:
acquiring historical browsing information of a target user in a preset period;
word segmentation processing is carried out on the historical browsing information, and keywords corresponding to the historical browsing information are obtained;
determining the embedding of the keywords corresponding to the historical browsing information according to the keywords corresponding to the historical browsing information;
determining keyword embedding corresponding to keywords of each document in the document data set;
calculating first similarity between keyword embedding corresponding to the historical browsing information and keyword embedding corresponding to keywords of each document in the document dataset;
sequencing the first similarity according to the sequence from high to low to obtain a first similarity sequencing result;
determining the documents corresponding to the first m similarity in the first similarity sorting result, wherein m is more than or equal to 1 and less than or equal to h, and h is the number of the first similarity between the keyword embedments corresponding to the keywords of each document;
and taking the documents corresponding to the first m similarity in the first similarity sorting result as document recommendation results corresponding to the target user.
In some embodiments, the recommendation unit 330 further comprises a second recommendation unit for:
acquiring a search request of a target user;
word segmentation processing is carried out on the search request to obtain search keywords;
determining the embedding of the search keywords corresponding to the search keywords;
determining keyword embedding corresponding to keywords of each document in the document data set;
determining a second similarity between the search keyword embedding and the keyword embedding corresponding to the keywords of each document in the document dataset;
sequencing the second similarity to obtain a second similarity sequencing result;
determining the maximum value in the second similarity sorting result, and acquiring a document corresponding to the maximum value;
and taking the document corresponding to the maximum value as a target document corresponding to the search request.
In the implementation, each module may be implemented as an independent entity, or may be combined arbitrarily, and implemented as the same entity or several entities, and the implementation of each module may be referred to the foregoing method embodiment, which is not described herein again. As can be seen from the above, the document recommendation device of the present embodiment may determine statistical information of documents in the document dataset and candidate words corresponding to the documents, respectively; determining keywords of the document according to the statistical information and the candidate words; and matching the acquired historical browsing information of the target user with the keywords to obtain a document recommendation result corresponding to the target user.
According to the application, the statistical information of each document in the document data set is firstly obtained, then the documents are subjected to word segmentation processing, so that candidate words corresponding to the documents are obtained, then the candidate words are subjected to characteristic quantization processing based on the statistical information of the documents, so that characteristic quantization values corresponding to the candidate words are obtained, and then the keywords of the documents are extracted from the candidate words according to the characteristic quantization values.
Correspondingly, the embodiment of the application also provides electronic equipment which can be a terminal or a server, wherein the terminal can be terminal equipment such as a smart phone, a tablet personal computer, a notebook computer, a touch screen, a game machine, a personal computer, a personal digital assistant (Personal Digital Assistant, PDA) and the like. The server may be a single server or a server cluster composed of a plurality of servers.
As shown in fig. 4, fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application, where the electronic device includes: memory 401, processor 402, and communication module 403.
The Memory 401 may be, but is not limited to, a random access Memory (Random Access Memory, RAM), a Read Only Memory (ROM), a programmable Read Only Memory (Programmable Read-Only Memory, PROM), an erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), an electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), a magnetic disk, a solid state disk, or the like. The memory 401 is used for storing a program, and the processor 402 executes the program after receiving an execution instruction.
The processor 402 may be an integrated circuit chip with data processing capabilities. The processor 402 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc. Various methods, steps, and logic blocks of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The communication module 403 is used for communication connection between the electronic device and the external device, and implements the operations of receiving and transmitting network signals and data. The network signals may include wireless signals or wired signals.
The specific implementation of each module can be referred to the previous embodiments, and will not be repeated here.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.
From the above, it can be seen that the electronic device provided in this embodiment can fully obtain the keywords of each document through the statistical information of the document, and then match the historical browsing information of the user with the keywords of the document to obtain the document recommendation result corresponding to the user.
Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.
To this end, the embodiment of the present application further provides a computer readable storage medium, where a document recommendation program is stored, where the document recommendation program is executed by a processor to implement the steps in any one of the document recommendation methods of the embodiments of the present application.
The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.
Wherein the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
The steps of any document recommendation method provided by the embodiment of the present application can be executed by the computer program stored in the storage medium, so that the beneficial effects of any document recommendation method provided by the embodiment of the present application can be achieved, and detailed descriptions of the previous embodiments are omitted.
The above description of the document recommendation method, the device, the electronic equipment and the storage medium provided by the embodiment of the application applies specific examples to describe the principle and the implementation of the application, and the description of the above embodiments is only used for helping to understand the method and the core idea of the application; meanwhile, as those skilled in the art will vary in the specific embodiments and application scope according to the ideas of the present application, the present description should not be construed as limiting the present application in summary.

Claims (12)

1. A document recommendation method, comprising:
respectively determining statistical information of a document in a document dataset and candidate words corresponding to the document;
determining keywords of the document according to the statistical information and the candidate words;
and matching the acquired historical browsing information of the target user with the keywords to obtain a document recommendation result corresponding to the target user.
2. The method of claim 1, wherein the determining candidate words corresponding to the document in the document dataset comprises:
and performing word segmentation processing on the document in the document dataset to obtain candidate words corresponding to the document.
3. The method of claim 2, wherein the word segmentation process is performed on the document in the document dataset to obtain candidate words corresponding to the document, and the method comprises:
determining the format of the document in the document dataset and the domain dictionary corresponding to the document respectively;
converting the format of the document into a preset format to obtain a document with the converted format;
updating the domain dictionary according to the document converted by the format to obtain an updated domain dictionary;
and carrying out word segmentation on the document based on the updated domain dictionary to obtain a plurality of candidate words corresponding to the document after word segmentation.
4. The method of claim 3, wherein updating the domain dictionary from the format-converted document to obtain an updated domain dictionary comprises:
extracting new words of the document after format conversion based on mutual information and information entropy;
and updating the domain dictionary according to the new words to obtain an updated domain dictionary.
5. The method of any of claims 1-4, wherein the determining keywords of the document from the statistical information and the candidate words comprises:
processing the candidate words according to the statistical information to obtain feature quantization values corresponding to the candidate words;
and selecting keywords of the document from the candidate words according to the characteristic quantization value.
6. The method of claim 5, wherein the processing the candidate word according to the statistical information to obtain the feature quantization value corresponding to the candidate word comprises:
respectively determining the weight, the position and the associated information of the candidate words in the document according to the statistical information;
performing feature quantization based on the weight of the candidate word in the document to obtain a weight quantization value corresponding to the candidate word;
performing feature quantization based on the position of the candidate word in the document to obtain a position quantization value corresponding to the candidate word;
performing feature quantization based on the association information of the candidate words in the document to obtain association information quantization values corresponding to the candidate words;
and carrying out weighting processing on the weight quantized value, the position quantized value and the associated information quantized value to obtain a characteristic quantized value corresponding to the candidate word.
7. The method of claim 6, wherein the selecting the keywords of the document from the candidate words according to the feature quantization value comprises:
sorting the feature quantized values corresponding to the candidate words according to the sequence from high to low to obtain a feature quantized sorting result;
determining candidate words corresponding to the first k feature quantized values in the feature quantized sequencing result, wherein k is more than or equal to 1 and less than or equal to n, and n is the number of feature quantized values corresponding to the candidate words;
and taking candidate words corresponding to the first k characteristic quantization values in the characteristic quantization sequencing result as keywords of the document.
8. The method of claim 7, wherein the matching the obtained historical browsing information of the target user with the keyword to obtain the document recommendation result corresponding to the target user comprises:
acquiring historical browsing information of a target user in a preset period;
word segmentation processing is carried out on the historical browsing information to obtain keywords corresponding to the historical browsing information;
determining the embedding of the keywords corresponding to the historical browsing information according to the keywords corresponding to the historical browsing information;
determining keyword embedding corresponding to keywords of each document in the document dataset;
calculating first similarity between keyword embedding corresponding to the historical browsing information and keyword embedding corresponding to keywords of each document in the document dataset;
sequencing the first similarity according to the sequence from high to low to obtain a first similarity sequencing result;
determining the documents corresponding to the first m similarity in the first similarity sorting result, wherein m is more than or equal to 1 and less than or equal to h, and h is the number of first similarity among keyword embedding corresponding to keywords of each document;
and taking the documents corresponding to the first m similarity in the first similarity sorting result as document recommendation results corresponding to the target user.
9. The method of claim 7, wherein the matching the obtained historical browsing information of the target user with the keyword to obtain the document recommendation result corresponding to the target user further comprises:
acquiring a search request of the target user;
word segmentation processing is carried out on the search request to obtain search keywords;
determining the embedding of the search keywords corresponding to the search keywords;
determining keyword embedding corresponding to keywords of each document in the document dataset;
determining a second similarity between the search keyword embedments and the keyword embedments corresponding to keywords of each document in the document dataset;
sequencing the second similarity to obtain a second similarity sequencing result;
determining the maximum value in the second similarity sorting result, and acquiring a document corresponding to the maximum value;
and taking the document corresponding to the maximum value as a target document corresponding to the search request.
10. A document recommendation apparatus, comprising:
a first determining unit, configured to determine statistical information of a document in a document dataset and candidate words corresponding to the document, respectively;
a second determining unit configured to determine keywords of the document according to the statistical information and the candidate words;
and the recommending unit is used for matching the acquired historical browsing information of the target user with the keywords to obtain a document recommending result corresponding to the target user.
11. An electronic device comprising a processor, a memory and a document recommendation program stored in the memory and executable on the processor, the processor when executing the document recommendation program to implement the steps in the document recommendation method according to any one of claims 1 to 9.
12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a document recommendation program, which when executed by a processor, implements the steps in the document recommendation method according to any one of claims 1 to 9.
CN202210254300.7A 2022-03-15 2022-03-15 Document recommendation method, device, electronic equipment and computer readable storage medium Pending CN116795947A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210254300.7A CN116795947A (en) 2022-03-15 2022-03-15 Document recommendation method, device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210254300.7A CN116795947A (en) 2022-03-15 2022-03-15 Document recommendation method, device, electronic equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN116795947A true CN116795947A (en) 2023-09-22

Family

ID=88035064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210254300.7A Pending CN116795947A (en) 2022-03-15 2022-03-15 Document recommendation method, device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN116795947A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117540057A (en) * 2024-01-10 2024-02-09 广东省电信规划设计院有限公司 Search guiding method and device based on AIGC
CN117708434A (en) * 2024-01-09 2024-03-15 青岛睿哲信息技术有限公司 Keyword-based user recommendation browsing content generation method
CN117874827A (en) * 2024-03-12 2024-04-12 武汉华工安鼎信息技术有限责任公司 Secret-related file management method, device and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117708434A (en) * 2024-01-09 2024-03-15 青岛睿哲信息技术有限公司 Keyword-based user recommendation browsing content generation method
CN117540057A (en) * 2024-01-10 2024-02-09 广东省电信规划设计院有限公司 Search guiding method and device based on AIGC
CN117540057B (en) * 2024-01-10 2024-04-30 广东省电信规划设计院有限公司 AIGC-based retrieval guiding method and AIGC-based retrieval guiding device
CN117874827A (en) * 2024-03-12 2024-04-12 武汉华工安鼎信息技术有限责任公司 Secret-related file management method, device and storage medium

Similar Documents

Publication Publication Date Title
CN111241241B (en) Case retrieval method, device, equipment and storage medium based on knowledge graph
CN106649818B (en) Application search intention identification method and device, application search method and server
CN110019794B (en) Text resource classification method and device, storage medium and electronic device
CN105912611B (en) A kind of fast image retrieval method based on CNN
CN116795947A (en) Document recommendation method, device, electronic equipment and computer readable storage medium
CN110569353A (en) Attention mechanism-based Bi-LSTM label recommendation method
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
WO2017000610A1 (en) Webpage classification method and apparatus
CN113434636B (en) Semantic-based approximate text searching method, semantic-based approximate text searching device, computer equipment and medium
CN111950728B (en) Image feature extraction model construction method, image retrieval method and storage medium
CN107844493B (en) File association method and system
CN114461839B (en) Multi-mode pre-training-based similar picture retrieval method and device and electronic equipment
CN113515589B (en) Data recommendation method, device, equipment and medium
CN110442702A (en) Searching method, device, readable storage medium storing program for executing and electronic equipment
CN112506864B (en) File retrieval method, device, electronic equipment and readable storage medium
CN114780746A (en) Knowledge graph-based document retrieval method and related equipment thereof
CN112559747A (en) Event classification processing method and device, electronic equipment and storage medium
TW202001621A (en) Corpus generating method and apparatus, and human-machine interaction processing method and apparatus
CN111859079B (en) Information searching method, device, computer equipment and storage medium
CN111325033A (en) Entity identification method, entity identification device, electronic equipment and computer readable storage medium
CN114328800A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN117312513B (en) Document search model training method, document search method and related device
CN117435685A (en) Document retrieval method, document retrieval device, computer equipment, storage medium and product
CN117057349A (en) News text keyword extraction method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication