CN113590861B - Picture information processing method and device and electronic equipment - Google Patents

Picture information processing method and device and electronic equipment Download PDF

Info

Publication number
CN113590861B
CN113590861B CN202010366994.4A CN202010366994A CN113590861B CN 113590861 B CN113590861 B CN 113590861B CN 202010366994 A CN202010366994 A CN 202010366994A CN 113590861 B CN113590861 B CN 113590861B
Authority
CN
China
Prior art keywords
picture
cluster
keyword
keywords
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010366994.4A
Other languages
Chinese (zh)
Other versions
CN113590861A (en
Inventor
潘达
董国盛
周泽南
苏雪峰
陈炜鹏
许静芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN202010366994.4A priority Critical patent/CN113590861B/en
Publication of CN113590861A publication Critical patent/CN113590861A/en
Application granted granted Critical
Publication of CN113590861B publication Critical patent/CN113590861B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/535Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for processing picture information and electronic equipment, wherein the method comprises the following steps: repeating picture clustering is carried out on pictures in the webpage, and a marked text field set of each cluster-like picture is obtained; aiming at each class cluster picture, acquiring a keyword and word weight thereof contained in each marked text field in the marked text field set according to the marked text field set, wherein the word weight is used for reflecting the relativity of the keyword and the class cluster picture; acquiring target keywords of each class cluster picture according to the word weights of all keywords corresponding to each class cluster picture; and ordering the picture search results according to the target keywords and the word weights of each class of cluster pictures. In the technical scheme, a large number of marked text fields are obtained through repeated picture clustering, target keywords of pictures and word weights of the target keywords are selected, and accordingly, the picture search results are ranked, and the technical problem that in the prior art, the picture search ranking accuracy is reduced due to the fact that pictures are inconsistent is solved.

Description

Picture information processing method and device and electronic equipment
Technical Field
The present invention relates to the field of software technologies, and in particular, to a method and an apparatus for processing picture information, and an electronic device.
Background
In internet applications, there are two implementations for searching pictures, one is searching pictures by using pictures, and the other is searching pictures according to query words. The second way is to search for a picture according to the matching between the picture description information provided by the web page where the picture is located and the query term.
At present, massive image-text pages are added every day on the Internet, the quality of the image-text pages is uneven, and the images-text inconsistent pages are not lacked. In addition, along with the diffusion and the forwarding of the pictures, the corresponding description information is gradually distorted due to editing and forwarding, and the situation of inconsistent pictures and texts occurs. The pages with different pictures and texts can have negative effects on picture searching and sorting, and the accuracy of picture searching and sorting is greatly reduced.
Disclosure of Invention
The embodiment of the invention provides a processing method and device of picture information and electronic equipment, which are used for solving the technical problem of reduced picture searching and sorting accuracy caused by pages with inconsistent pictures and texts in the prior art and improving the picture searching and sorting accuracy.
The embodiment of the invention provides a method for processing picture information, which comprises the following steps:
Repeating picture clustering is carried out on pictures in the webpage, and a cluster picture of each cluster and a marked text field set of the cluster picture are obtained;
For each cluster-like picture, acquiring keywords contained in each marked text field in the marked text field set and word weights of the keywords according to the marked text field set, wherein the word weights are used for reflecting the relativity of the keywords and the cluster-like picture;
acquiring target keywords of each class cluster picture according to the word weights of all keywords corresponding to each class cluster picture;
And sorting the picture search results according to the target keywords of each class cluster picture and the word weights of the target keywords.
Optionally, the obtaining, according to the set of tagged text fields, a keyword included in each tagged text field in the set of tagged text fields and a word weight of the keyword includes:
acquiring keywords in each marked text field;
The following target parameters are obtained for each keyword: the word frequency and the occurrence frequency of the keywords in the belonging marked text field and the number of website domain names corresponding to the keywords;
And calculating and obtaining the word weight of each keyword according to the target parameter of each keyword.
Optionally, calculating to obtain a word weight of each keyword according to the target parameter of each keyword, including:
For each keyword, calculating and obtaining the importance degree of the keyword in each affiliated marked text field according to the word frequency and the occurrence frequency of the keyword in each affiliated marked text field, wherein the importance degree is accumulated in a decaying way according to the word frequency and the occurrence frequency;
and calculating to obtain the word weight of each keyword according to the importance degree of the keywords in all the affiliated marked text fields and the number of website domain names corresponding to the keywords.
Optionally, the sorting the picture search results according to the target keywords and the word weights of the target keywords of each cluster-like picture includes:
Matching search words adopted in picture searching with the target keywords of each class of cluster pictures to obtain matching keywords;
Calculating a matching score between the target keyword and the search word of each cluster picture according to the word weight of the matching keyword as the search word and the word weight of the matching keyword as the target keyword;
and sorting the picture search results according to each matching score.
Optionally, the calculating a matching score between the target keyword and the search word of each cluster-like picture according to the word weight of the matching keyword as the search word and the word weight of the matching keyword as the target keyword includes:
Aiming at each class cluster picture, calculating to obtain the word weight of the matched keyword according to the word weight of the matched keyword serving as a target keyword and the word weight of the matched keyword serving as a search word; according to the union of the search word and the target keyword and the matching keyword, calculating to obtain matching weight;
and calculating and obtaining a matching score between the target keyword and the search word of each cluster-like picture according to the word weight of the matching keyword and the matching weight.
Optionally, the performing repeated image clustering on the images in the web page to obtain a cluster image of each cluster and a marked text field set of the cluster image includes:
repeating picture clustering is carried out on pictures in the webpage, and class cluster pictures of each class cluster are obtained;
extracting a marked text field of the cluster-like picture from each webpage in which the cluster-like picture is positioned;
and removing junk texts in the marked text fields, and taking all marked text fields from which the junk texts are removed as the marked text field set.
Optionally, the removing junk text in the marked text field includes:
searching text contents in the marked text field through a preset matching mode, and removing the marked text field with the text contents being junk text; and/or the number of the groups of groups,
Removing the marked text field obtained from the junk web page according to the web page type corresponding to the marked text field; and/or the number of the groups of groups,
And removing the marked text field with the release time earlier than the set time according to the corresponding webpage release time of the marked text field.
The embodiment of the invention also provides a device for processing the picture information, which comprises:
the clustering unit is used for carrying out repeated picture clustering on pictures in the webpage to obtain a class cluster picture of each class cluster and a marked text field set of the class cluster picture;
the word weight calculation unit is used for obtaining keywords and word weights of the keywords contained in each marked text field in the marked text field set according to the marked text field set for each cluster picture, wherein the word weights are used for reflecting the relativity of the keywords and the cluster pictures;
The keyword extraction unit is used for obtaining target keywords of each class cluster picture according to the word weights of all keywords corresponding to each class cluster picture;
And the ordering unit is used for ordering the picture search results according to the target keywords of each class of cluster pictures and the word weights of the target keywords.
Optionally, the word weight calculating unit is configured to:
acquiring keywords in each marked text field;
The following target parameters are obtained for each keyword: the word frequency and the occurrence frequency of the keywords in the belonging marked text field and the number of website domain names corresponding to the keywords;
And calculating and obtaining the word weight of each keyword according to the target parameter of each keyword.
Optionally, the word weight calculation unit is further configured to:
For each keyword, calculating and obtaining the importance degree of the keyword in each affiliated marked text field according to the word frequency and the occurrence frequency of the keyword in each affiliated marked text field, wherein the importance degree is accumulated in a decaying way according to the word frequency and the occurrence frequency;
and calculating to obtain the word weight of each keyword according to the importance degree of the keywords in all the affiliated marked text fields and the number of website domain names corresponding to the keywords.
Optionally, the sorting unit is configured to:
Matching search words adopted in picture searching with the target keywords of each class of cluster pictures to obtain matching keywords;
Calculating a matching score between the target keyword and the search word of each cluster picture according to the word weight of the matching keyword as the search word and the word weight of the matching keyword as the target keyword;
and sorting the picture search results according to each matching score.
Optionally, the sorting unit is further configured to:
Aiming at each class cluster picture, calculating to obtain the word weight of the matched keyword according to the word weight of the matched keyword serving as a target keyword and the word weight of the matched keyword serving as a search word; according to the union of the search word and the target keyword and the matching keyword, calculating to obtain matching weight;
and calculating and obtaining a matching score between the target keyword and the search word of each cluster-like picture according to the word weight of the matching keyword and the matching weight.
Optionally, the clustering unit is configured to:
repeating picture clustering is carried out on pictures in the webpage, and class cluster pictures of each class cluster are obtained;
extracting a marked text field of the cluster-like picture from each webpage in which the cluster-like picture is positioned;
and removing junk texts in the marked text fields, and taking all marked text fields from which the junk texts are removed as the marked text field set.
Optionally, the clustering unit is further configured to:
searching text contents in the marked text field through a preset matching mode, and removing the marked text field with the text contents being junk text; and/or the number of the groups of groups,
Removing the marked text field obtained from the junk web page according to the web page type corresponding to the marked text field; and/or the number of the groups of groups,
And removing the marked text field with the release time earlier than the set time according to the corresponding webpage release time of the marked text field.
The above technical solutions in the embodiments of the present application at least have the following technical effects:
The embodiment of the application provides a processing method of picture information, which comprises the steps of carrying out repeated picture clustering on pictures in a webpage to obtain class cluster pictures of each class cluster and a marked text field set thereof; for each cluster-like picture, acquiring a keyword and a word weight of the keyword contained in each marked text field in the set according to the marked text field set, and reflecting the correlation degree of the keyword and the cluster-like picture through the word weight; acquiring a target keyword of each class cluster picture and the word weight of the target keyword according to the word weight of the keyword; and sequencing the picture search results according to the target keywords of each class cluster picture and the word weights of the target keywords. According to the technical scheme, a large number of mark text fields are obtained through repeated image clustering, the target keywords of the images and the weights of the target keywords are selected based on the large number of mark text fields, the accuracy of image description information is improved, the image search results are ranked according to the accuracy, the technical problem that the accuracy of image search ranking is reduced due to the fact that page images and texts are inconsistent in the prior art is solved, and the accuracy of image search ranking is improved.
Drawings
FIG. 1 is a schematic flow chart of a search data processing method according to an embodiment of the present application;
FIG. 2 is a block diagram of a data processing apparatus according to an embodiment of the present application;
Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In the technical scheme provided by the embodiment of the application, a processing method of picture information is provided, a large number of marked text fields are obtained by repeating picture clustering, and high-quality keywords and word weights thereof are selected from the marked text fields to perform picture searching and sorting, so that the technical problem of reduced picture searching and sorting accuracy caused by inconsistent page pictures and texts in the prior art is solved.
The main implementation principle, the specific implementation manner and the corresponding beneficial effects of the technical scheme of the embodiment of the application are described in detail below with reference to the accompanying drawings.
Examples
Referring to fig. 1, an embodiment of the present application provides a method for processing picture information, including the following steps S11 to S14:
S11, carrying out repeated picture clustering on pictures in the webpage, and obtaining a class cluster picture of each class cluster and a marked text field set of the class cluster picture.
The repeated picture clustering is to group the identical pictures together, for example, the total amount of the original pictures searched before clustering is about 60 hundred million by picture searching, 8 hundred million class clusters are clustered, and each class cluster represents different pictures, namely 8 hundred million different pictures currently. The pictures corresponding to each cluster are called cluster-like pictures. Each cluster-like picture may appear in a large number of different web pages, and the marked text fields of the cluster-like picture, such as the title of the picture, the surrounding text of the picture, the descriptive text of the picture, etc., may be obtained from each web page where the cluster-like picture is located, and generally, the descriptive text of the picture is a line of text information immediately below the picture and adjacent to the picture, the surrounding text of the picture is text information within a certain distance from the picture, and all the marked text fields form a marked text field set of the cluster-like picture.
S12, aiming at each cluster picture, acquiring keywords and word weights of the keywords contained in each marked text field in the marked text field set according to the marked text field set.
The word weight of the keyword is used for reflecting the correlation degree of the keyword and the cluster-like picture. The larger the word weight is, the larger the correlation degree of the word weight and the cluster-like picture is, the more important degree of the keyword to the cluster-like picture is, otherwise, the smaller the word weight is, the smaller the correlation degree of the word weight and the cluster-like picture is, and the less important degree of the keyword to the cluster-like picture is. By obtaining the word weight of the keywords, the importance degree of the keywords on the cluster pictures can be clarified.
S13, obtaining target keywords of each class cluster picture according to the word weights of all keywords corresponding to each class cluster picture.
The marked text field set corresponding to each cluster-like picture contains a large number of keywords, but not all keywords are consistent with the cluster-like picture, and in the embodiment, high-quality keyword screening is performed through the keyword weight of the keywords to obtain high-quality target keywords. And marking the cluster-like picture by using the target keyword and the word weight of the target keyword as text description information of the cluster-like picture.
S14, sorting the picture search results according to the target keywords of each class cluster picture and the word weights of the target keywords.
The target keywords and the word weights of the pictures in each class of cluster serve as characteristic parameters for sorting the picture search results, and participate in sorting the picture search results.
In the above embodiment, a large number of tagged text fields from different webpages are obtained by repeating image clustering, word weights of image keywords are obtained according to the large number of tagged text fields, and then cluster-like images are tagged according to the word weights, and high-quality target keywords and the word weights thereof are selected for the cluster-like images as image description information, so that more accurate image description information is obtained, image search result sorting is performed, and accuracy of image search result sorting is improved.
In a specific real-time process, when S11 performs repeated picture clustering on pictures in a webpage, the pictures in the webpage on the Internet can be crawled first; then, each picture is stored by using a split-ring algorithm (because hundreds of millions of pictures cannot exist on one machine, a plurality of machines are used for storing the pictures); further, each picture is subjected to neighbor search on all the pictures on the machine at the same time, the pictures which are completely the same as the pictures are found, then neighbor search results of the pictures are combined by utilizing a union set, all the pictures which are the same as each picture are associated, and a special cluster id is defined, so that repeated picture clustering is realized, and the pictures corresponding to the cluster id are cluster pictures of the cluster.
After the repeated picture clustering is completed and the cluster-like pictures are obtained, S11 further collects the marked text fields of the cluster-like pictures from each webpage where the cluster-like pictures are located, namely, crawls the marked text fields of the pictures from the webpages where a large number of repeated pictures are located. In order to improve the quality of the marked text fields, junk texts in the marked text fields are removed, and all the marked text fields after the junk texts are removed are used as a marked text field set of the cluster-like pictures.
Specifically, the junk text removal may be performed by at least one of the following:
1. Searching text contents in the marked text field through a preset matching mode, and removing the marked text field with the text contents being junk text. The preset matching mode is a matching mode of single words and spaces, and the matching mode can mine out characteristic texts of types such as bets, parasites and the like. For example: through the matching mode of single words and spaces, junk texts of types such as static parasite ranking, online gambling field uniquely specified by authorities, online dealing of true-man and beauty lotus officers and the like can be mined.
2. And removing the marked text field obtained from the junk web page according to the web page type corresponding to the marked text field. The types of web pages on the internet can be generally classified into spam web pages and normal web pages, which can be filtered out for tagged text fields from spam web pages.
3. And removing the mark text field with the release time earlier than the set time according to the corresponding webpage release time of the mark text field. The setting time can be set according to the required data quantity of the marked text field, and the larger the quantity is, the longer the setting time is, and the smaller the setting time is otherwise. For example: if the set time is 1 year, a certain tagged text field is from web page a, and the release time of web page a is 3 years ago, then the tagged text field is removed.
4. And removing the marked text field with the text size smaller than the set size according to the size of the marked text field. For example: and removing the marked text fields with the length and the width smaller than 180 mm.
5. And comparing the text contents, and removing the marked text field with repeated text contents.
And according to the marked text field set obtained after the junk text is removed, executing S12 to calculate the word weight of the keywords contained in each marked text field in the marked text field set, and representing the importance degree of the keywords to the cluster pictures by the word weight, wherein the importance degree of the keywords is in direct proportion to the word weight. Specifically, aiming at a marked text field set of each cluster picture, acquiring keywords in each marked text field in the marked text field set; the following target parameters are obtained for each keyword: word frequency of keywords in the belonging marked text field, occurrence frequency of the keywords in the belonging marked text field and number of website domain names corresponding to the keywords; and calculating and obtaining the word weight of each keyword according to the target parameter of each keyword.
For each keyword, calculating to obtain the importance degree of the keyword in each affiliated marked text field according to the word frequency and the occurrence frequency of the keyword in each affiliated marked text field, wherein the importance degree is accumulated according to the attenuation of the word frequency and the occurrence frequency of the keyword, and the importance degree can be obtained by the calculation according to the following formula II: further, according to the importance degree of the keywords in all the affiliated marked text fields and the number of website domain names corresponding to the keywords, calculating and obtaining the word weight of each keyword. The method can be obtained by calculating the following formula I:
w represents the word weight, i represents the ith marked text field in the marked text field set, n represents the number of text fields in the marked text field set, f (i) represents the importance degree of a keyword in the ith marked text field, n_page_domain uniq represents the number of website domain names corresponding to the cluster-like pictures, max_weight represents the preset maximum weight value, norm_k represents the normalization parameter, m represents the word frequency, and n domain represents the number of times the keyword appears in the belonging marked text field. Where a maximum value of the word weight may be 255 to indicate that the keyword term frequently appears in each tagged text field, and a minimum value of 1 may indicate that the keyword term rarely appears in the class cluster. The value of norm_k may be 4.
Based on the word weight obtained by the calculation, S13 is performed to screen the keywords. Specifically, all keywords of each class of cluster pictures can be ordered according to the word weight, and the first n keywords with the largest word weight are obtained as target keywords, wherein n is more than or equal to 1. For example: assuming that a plot of "plot happy" is repeated 76 times in total, obtaining 50 available tagged text fields from the web page in which it appears, and calculating the word weight of each keyword in each tagged text field [ plot: 49; the plot of the lover is happy 26; 24 parts of a three-dimensional structure; 10, lovers; 10 parts of materials; the Suzhou: 10), the target keyword of the picture is extracted according to the word weight of each keyword to be [ the plot of the person: 49; the plot of the lover is happy 26; and (3) eliminating some keywords with inconsistent or irrelevant pictures and texts at the three-dimensional stage 24.
After the target keywords of each cluster-like picture are obtained, S14 is executed to sort the picture search results. Specifically, the target keyword of each cluster-like picture can be matched with the search word adopted in the picture searching process, and the matching score between the target keyword of the cluster-like picture and the search word is calculated and obtained according to the matched matching keyword and the word weight thereof; and sorting the picture search results according to the matching scores corresponding to each class of cluster pictures. When the target keywords of the cluster-like picture are matched with the search words, the description text of the cluster-like picture is matched with the search words, and when a matching score cluster_tf_ bmrank between the target keywords and the search words is calculated, the description text of the cluster-like picture can be obtained according to the product of a text score doc_score of the description text of the cluster-like picture and a text matching score jaccard_match_weight between the description text and the search words, wherein the product is shown in the following formula III:
cluster tf bmrank = doc score jaccard match weight formula three
The text score doc_score is actually the word weight of the matching keyword on the matching of the search word in all the target keywords corresponding to the cluster-like picture. The matching keywords are the same keywords appearing in the target keywords and the search words, and for this purpose, the word weights of the matching keywords can be obtained by calculating the word weights of the matching keywords serving as the target keywords and the word weights serving as the search words. Specifically, doc_score can be obtained by calculation according to the following formula four:
doc_score = Σmatch_term_freq_query_weight formula four
Where match_term_freq represents the word weight of the matching keyword as the target keyword, and query_weight represents the word weight of the matching keyword as the search word.
The jaccard_match_weight is a text matching score, and represents the matching degree between all target keywords of the cluster-like picture and the search word, which is also called as a matching weight, and can be obtained according to the union of the search word and the target keywords and the matching keyword through calculation, as shown in a formula five:
Wherein query_term_freq_cluster_term-freq represents the sum of the word weights of the union of the search word and the target keyword=the total number of keywords in the union.
Based on the repeated picture clustering, a large number of repeated docs can be obtained. Firstly, carrying out garbage doc filtering through information such as time, sites, sizes and texts, calculating word weights through related texts, sites and word distribution, and finally generating features (noted as cluster_tf_ bmrank) through matching conditions of the word weights and the picture search query, participating in training of a sorting model, and further optimizing the sorting effect of picture search.
And according to the calculated matching score cluster_tf_ bmrank of each cluster of pictures, taking the cluster_tf_ bmrank as a one-dimensional feature to participate in training of a picture search result ordering model, and further optimizing the ordering effect of the picture search results. Or, the existing picture search results are reordered according to the matching score cluster_tf_ bmrank, and the ordering result is optimized, so that the optimized ordering result can reflect the matching degree between the pictures and the search intention, and the ordering accuracy is improved.
Based on the same inventive concept, for the method for processing picture information provided in the foregoing embodiment, the embodiment of the present application further correspondingly provides a device for processing picture information, please refer to fig. 2, where the device includes:
A clustering unit 21, configured to perform repeated image clustering on images in a web page, and obtain a class cluster image of each class cluster and a labeled text field set of the class cluster image;
A word weight calculating unit 22, configured to obtain, for each cluster-like picture, a keyword and a word weight of the keyword, where the keyword and the word weight are included in each tagged text field in the tagged text field set according to the tagged text field set, where the word weight is used to reflect a relevance between the keyword and the cluster-like picture;
a keyword extraction unit 23, configured to obtain a target keyword of each cluster-like picture according to the word weights of all keywords corresponding to each cluster-like picture;
And the ranking unit 24 is configured to rank the picture search results according to the target keywords and the word weights of the target keywords of each cluster-like picture.
As an alternative embodiment, the word weight calculation unit 22 is configured to: acquiring keywords in each marked text field; the following target parameters are obtained for each keyword: the word frequency and the occurrence frequency of the keywords in the belonging marked text field and the number of website domain names corresponding to the keywords; and calculating and obtaining the word weight of each keyword according to the target parameter of each keyword. Optionally, the word weight calculating unit 22 is further configured to: for each keyword, calculating and obtaining the importance degree of the keyword in each affiliated marked text field according to the word frequency and the occurrence frequency of the keyword in each affiliated marked text field, wherein the importance degree is accumulated in a decaying way according to the word frequency and the occurrence frequency; and calculating to obtain the word weight of each keyword according to the importance degree of the keywords in all the affiliated marked text fields and the number of website domain names corresponding to the keywords.
As an alternative embodiment, the sorting unit 24 is configured to: matching search words adopted in picture searching with the target keywords of each class of cluster pictures to obtain matching keywords; calculating a matching score between the target keyword and the search word of each cluster picture according to the word weight of the matching keyword as the search word and the word weight of the matching keyword as the target keyword; and sorting the picture search results according to each matching score. Optionally, the sorting unit 24 is further configured to: aiming at each class cluster picture, calculating to obtain the word weight of the matched keyword according to the word weight of the matched keyword serving as a target keyword and the word weight of the matched keyword serving as a search word; according to the union of the search word and the target keyword and the matching keyword, calculating to obtain matching weight; and calculating and obtaining a matching score between the target keyword and the search word of each cluster-like picture according to the word weight of the matching keyword and the matching weight.
As an alternative embodiment, the clustering unit 21 is configured to: repeating picture clustering is carried out on pictures in the webpage, and class cluster pictures of each class cluster are obtained; extracting a marked text field of the cluster-like picture from each webpage in which the cluster-like picture is positioned; and removing junk texts in the marked text fields, and taking all marked text fields from which the junk texts are removed as the marked text field set. Optionally, as an optional embodiment, the clustering unit 21 is further configured to: searching text contents in the marked text field through a preset matching mode, and removing the marked text field with the text contents being junk text; and/or removing the marked text field obtained from the junk web page according to the web page type corresponding to the marked text field; and/or removing the marked text field with the release time earlier than the set time according to the corresponding webpage release time of the marked text field.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
Fig. 3 is a block diagram of an electronic device 800 for implementing a processing method of picture information, according to an example embodiment. For example, electronic device 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.
Referring to fig. 3, the electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/presentation (I/O) interface 812, a sensor component 814, and a communication component 816.
The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing element 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 may include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operations at the device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 800.
The multimedia component 808 includes a screen between the electronic device 800 and the user that provides a presentation interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
The audio component 810 is configured to present and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, the audio component 810 further includes a speaker for rendering audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.
The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the electronic device 800. For example, the sensor assembly 814 may detect an on/off state of the device 800, a relative positioning of the components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in position of the electronic device 800 or a component of the electronic device 800, the presence or absence of a user's contact with the electronic device 800, an orientation or acceleration/deceleration of the electronic device 800, and a change in temperature of the electronic device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communication between the electronic device 800 and other devices, either wired or wireless. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication part 816 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including instructions executable by processor 820 of electronic device 800 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
A non-transitory computer readable storage medium, which when executed by a processor of a mobile terminal, causes the mobile terminal to perform a method of processing picture information, the method comprising: repeating picture clustering is carried out on pictures in the webpage, and a cluster picture of each cluster and a marked text field set of the cluster picture are obtained; for each cluster-like picture, acquiring keywords contained in each marked text field in the marked text field set and word weights of the keywords according to the marked text field set, wherein the word weights are used for reflecting the relativity of the keywords and the cluster-like picture; acquiring target keywords of each class cluster picture according to the word weights of all keywords corresponding to each class cluster picture; and sorting the picture search results according to the target keywords of each class cluster picture and the word weights of the target keywords.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims (9)

1. A method for processing picture information, the method comprising:
Repeating picture clustering is carried out on pictures in the webpage, and a cluster picture of each cluster and a marked text field set of the cluster picture are obtained;
For each cluster-like picture, acquiring keywords contained in each marked text field in the marked text field set and word weights of the keywords according to the marked text field set, wherein the word weights are used for reflecting the relativity of the keywords and the cluster-like picture;
sorting all keywords according to the word weights of all keywords corresponding to each class cluster picture, and acquiring the first n keywords with the largest word weights as target keywords, so as to acquire the target keywords of each class cluster picture;
Sorting the picture search results according to the target keywords of each cluster-like picture and the word weights of the target keywords;
the step of obtaining the keyword and the word weight of the keyword contained in each tagged text field in the tagged text field set according to the tagged text field set includes:
acquiring keywords in each marked text field;
The following target parameters are obtained for each keyword: the word frequency and the occurrence frequency of the keywords in the belonging marked text field and the number of website domain names corresponding to the keywords;
And calculating and obtaining the word weight of each keyword according to the target parameter of each keyword.
2. The method of claim 1, wherein computing a word weight for each of the keywords based on the target parameters for each of the keywords comprises:
For each keyword, calculating and obtaining the importance degree of the keyword in each affiliated marked text field according to the word frequency and the occurrence frequency of the keyword in each affiliated marked text field, wherein the importance degree is accumulated in a decaying way according to the word frequency and the occurrence frequency;
and calculating to obtain the word weight of each keyword according to the importance degree of the keywords in all the affiliated marked text fields and the number of website domain names corresponding to the keywords.
3. The method of claim 1, wherein the ranking the picture search results according to the target keyword and the word weight of the target keyword for each cluster-like picture comprises:
Matching search words adopted in picture searching with the target keywords of each class of cluster pictures to obtain matching keywords;
Calculating a matching score between the target keyword and the search word of each cluster picture according to the word weight of the matching keyword as the search word and the word weight of the matching keyword as the target keyword;
and sorting the picture search results according to each matching score.
4. The method of claim 3, wherein the calculating a matching score between the target keyword and the search word for each cluster-like picture based on the word weight of the matching keyword as the search word and the word weight of the matching keyword as the target keyword comprises:
Aiming at each class cluster picture, calculating to obtain the word weight of the matched keyword according to the word weight of the matched keyword serving as a target keyword and the word weight of the matched keyword serving as a search word; according to the union of the search word and the target keyword and the matching keyword, calculating to obtain matching weight;
and calculating and obtaining a matching score between the target keyword and the search word of each cluster-like picture according to the word weight of the matching keyword and the matching weight.
5. The method of any one of claims 1 to 4, wherein the performing repeated picture clustering on the pictures in the web page to obtain a cluster-like picture of each cluster and a labeled text field set of the cluster-like picture includes:
repeating picture clustering is carried out on pictures in the webpage, and class cluster pictures of each class cluster are obtained;
extracting a marked text field of the cluster-like picture from each webpage in which the cluster-like picture is positioned;
and removing junk texts in the marked text fields, and taking all marked text fields from which the junk texts are removed as the marked text field set.
6. The method of claim 5, wherein the removing the junk text in the tagged text field comprises:
searching text contents in the marked text field through a preset matching mode, and removing the marked text field with the text contents being junk text; and/or the number of the groups of groups,
Removing the marked text field obtained from the junk web page according to the web page type corresponding to the marked text field; and/or the number of the groups of groups,
And removing the marked text field with the release time earlier than the set time according to the corresponding webpage release time of the marked text field.
7. A picture information processing apparatus, characterized in that the apparatus comprises:
the clustering unit is used for carrying out repeated picture clustering on pictures in the webpage to obtain a class cluster picture of each class cluster and a marked text field set of the class cluster picture;
the word weight calculation unit is used for obtaining keywords and word weights of the keywords contained in each marked text field in the marked text field set according to the marked text field set for each cluster picture, wherein the word weights are used for reflecting the relativity of the keywords and the cluster pictures;
the word weight calculation unit is used for: acquiring keywords in each marked text field; the following target parameters are obtained for each keyword: the word frequency and the occurrence frequency of the keywords in the belonging marked text field and the number of website domain names corresponding to the keywords; according to the target parameters of each keyword, calculating to obtain the word weight of each keyword;
The keyword extraction unit is used for sequencing all keywords according to the word weights of all keywords corresponding to each cluster-like picture, and acquiring the first n keywords with the largest word weights as target keywords so as to acquire the target keywords of each cluster-like picture;
And the ordering unit is used for ordering the picture search results according to the target keywords of each class of cluster pictures and the word weights of the target keywords.
8. An electronic device comprising a memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for performing the method of any of claims 1-6.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, carries out the steps corresponding to the method according to any one of claims 1-6.
CN202010366994.4A 2020-04-30 2020-04-30 Picture information processing method and device and electronic equipment Active CN113590861B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010366994.4A CN113590861B (en) 2020-04-30 2020-04-30 Picture information processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010366994.4A CN113590861B (en) 2020-04-30 2020-04-30 Picture information processing method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN113590861A CN113590861A (en) 2021-11-02
CN113590861B true CN113590861B (en) 2024-06-18

Family

ID=78237596

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010366994.4A Active CN113590861B (en) 2020-04-30 2020-04-30 Picture information processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN113590861B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114710554B (en) * 2022-03-30 2024-04-26 北京奇艺世纪科技有限公司 Message processing method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033385A (en) * 2018-07-27 2018-12-18 百度在线网络技术(北京)有限公司 Picture retrieval method, device, server and storage medium
CN110019675A (en) * 2017-12-01 2019-07-16 北京搜狗科技发展有限公司 A kind of method and device of keyword extraction

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164436B (en) * 2011-12-13 2017-06-16 阿里巴巴集团控股有限公司 A kind of image search method and device
CN102609458B (en) * 2012-01-12 2015-08-05 北京搜狗信息服务有限公司 A kind of picture recommendation method and device
CN103544186B (en) * 2012-07-16 2017-03-01 富士通株式会社 The method and apparatus excavating the subject key words in picture
CN103995848B (en) * 2014-05-06 2017-04-05 百度在线网络技术(北京)有限公司 Image searching method and device
CN104504109B (en) * 2014-12-30 2017-03-15 百度在线网络技术(北京)有限公司 Image searching method and device
CN104881401B (en) * 2015-05-27 2017-10-17 大连理工大学 A kind of patent document clustering method
CN105354307B (en) * 2015-11-06 2021-01-15 腾讯科技(深圳)有限公司 Image content identification method and device
CN110956038B (en) * 2019-10-16 2022-07-05 厦门美柚股份有限公司 Method and device for repeatedly judging image-text content
CN110765301B (en) * 2019-11-06 2022-02-25 腾讯科技(深圳)有限公司 Picture processing method, device, equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019675A (en) * 2017-12-01 2019-07-16 北京搜狗科技发展有限公司 A kind of method and device of keyword extraction
CN109033385A (en) * 2018-07-27 2018-12-18 百度在线网络技术(北京)有限公司 Picture retrieval method, device, server and storage medium

Also Published As

Publication number Publication date
CN113590861A (en) 2021-11-02

Similar Documents

Publication Publication Date Title
CN109614482B (en) Label processing method and device, electronic equipment and storage medium
CN107526744B (en) Information display method and device based on search
CN108121736B (en) Method and device for establishing subject term determination model and electronic equipment
CN107992604B (en) Task item distribution method and related device
CN111291069B (en) Data processing method and device and electronic equipment
CN109918565B (en) Processing method and device for search data and electronic equipment
CN110019675B (en) Keyword extraction method and device
CN111708943B (en) Search result display method and device for displaying search result
CN107315487B (en) Input processing method and device and electronic equipment
CN110399548A (en) A kind of search processing method, device, electronic equipment and storage medium
CN106815291B (en) Search result item display method and device and search result item display device
CN110162691B (en) Topic recommendation, operation method, device and machine equipment in online content service
CN112148923B (en) Method for ordering search results, method, device and equipment for generating ordering model
CN110110207B (en) Information recommendation method and device and electronic equipment
CN112784142A (en) Information recommendation method and device
CN111813932A (en) Text data processing method, text data classification device and readable storage medium
CN113590861B (en) Picture information processing method and device and electronic equipment
CN110175293B (en) Method and device for determining news venation and electronic equipment
CN110110046B (en) Method and device for recommending entities with same name
CN112463827B (en) Query method, query device, electronic equipment and storage medium
CN112052395B (en) Data processing method and device
CN112149653B (en) Information processing method, information processing device, electronic equipment and storage medium
CN111382367B (en) Search result ordering method and device
CN113918661A (en) Knowledge graph generation method and device and electronic equipment
CN107870941B (en) Webpage sorting method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40055320

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant