CN113590861A - Picture information processing method and device and electronic equipment - Google Patents

Picture information processing method and device and electronic equipment Download PDF

Info

Publication number
CN113590861A
CN113590861A CN202010366994.4A CN202010366994A CN113590861A CN 113590861 A CN113590861 A CN 113590861A CN 202010366994 A CN202010366994 A CN 202010366994A CN 113590861 A CN113590861 A CN 113590861A
Authority
CN
China
Prior art keywords
cluster
picture
keywords
keyword
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010366994.4A
Other languages
Chinese (zh)
Other versions
CN113590861B (en
Inventor
潘达
董国盛
周泽南
苏雪峰
陈炜鹏
许静芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN202010366994.4A priority Critical patent/CN113590861B/en
Publication of CN113590861A publication Critical patent/CN113590861A/en
Application granted granted Critical
Publication of CN113590861B publication Critical patent/CN113590861B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/535Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a picture information processing method, a picture information processing device and electronic equipment, wherein the method comprises the following steps: repeating picture clustering is carried out on pictures in the webpage, and a marked text field set of each cluster picture is obtained; aiming at each cluster-like picture, acquiring keywords and word weights thereof contained in each mark text field in the mark text field set according to the mark text field set, wherein the word weights are used for reflecting the relevancy of the keywords and the cluster-like pictures; acquiring a target keyword of each class cluster picture according to the word weight of all keywords corresponding to each class cluster picture; and sequencing the picture search results according to the target keywords and the word weights of the target keywords of each class of cluster picture. In the technical scheme, a large number of marked text fields are obtained by repeating picture clustering, the target keywords and the word weights of the pictures are selected, and the picture search results are sorted accordingly, so that the technical problem that the picture search sorting accuracy is reduced due to picture and text inconsistency in the prior art is solved.

Description

Picture information processing method and device and electronic equipment
Technical Field
The present invention relates to the field of software technologies, and in particular, to a method and an apparatus for processing picture information, and an electronic device.
Background
In internet application, there are two implementation ways for searching pictures, one is searching pictures according to pictures, and the other is searching pictures according to query words. The second method is usually to search for pictures according to the matching between the picture description information provided by the web page where the picture is located and the query terms.
At present, a large number of image-text pages are added on the internet every day, the image-text pages have different quality, and the pages with inconsistent images and texts are not available. In addition, with the diffusion and forwarding of pictures, the corresponding description information is gradually distorted due to editing and forwarding, and the pictures and texts do not conform to each other. The pages with inconsistent pictures and texts have negative influence on the searching and sequencing of the pictures, and the accuracy of the searching and sequencing of the pictures is greatly reduced.
Disclosure of Invention
The embodiment of the invention provides a picture information processing method and device and electronic equipment, which are used for solving the technical problem that in the prior art, the picture searching and sorting accuracy is reduced due to a page with inconsistent pictures and texts, and improving the picture searching and sorting accuracy.
The embodiment of the invention provides a method for processing picture information, which comprises the following steps:
performing repeated picture clustering on pictures in a webpage to obtain cluster pictures of each cluster and a marked text field set of the cluster pictures;
aiming at each cluster-like picture, acquiring keywords contained in each mark text field in the mark text field set and word weights of the keywords according to the mark text field set, wherein the word weights are used for reflecting the correlation degree of the keywords and the cluster-like pictures;
acquiring a target keyword of each class cluster picture according to the word weight of all keywords corresponding to each class cluster picture;
and sequencing the picture search results according to the target keywords of each class of cluster picture and the word weights of the target keywords.
Optionally, the obtaining, according to the tagged text field set, keywords included in each tagged text field in the tagged text field set and word weights of the keywords includes:
acquiring a keyword in each marked text field;
the following target parameters are obtained for each keyword: the word frequency and the occurrence frequency of the keywords in the mark text domain and the number of the website domain names corresponding to the keywords;
and calculating the word weight of each keyword according to the target parameter of each keyword.
Optionally, calculating a word weight of each keyword according to the target parameter of each keyword, including:
aiming at each keyword, calculating and obtaining the importance degree of the keyword in each affiliated label text domain according to the word frequency and the occurrence frequency of the keyword in each affiliated label text domain, wherein the importance degree is accumulated according to the attenuation of the word frequency and the occurrence frequency;
and calculating the word weight of each keyword according to the importance degree of the keyword in all the belonging marked text fields and the number of the website domain names corresponding to the keyword.
Optionally, the sorting the image search results according to the target keyword of each cluster image and the word weight of the target keyword includes:
matching search words adopted during image search with the target keywords of each cluster image to obtain matched keywords;
calculating a matching score between the target keyword and the search word of each class of cluster picture according to the word weight of the matching keyword as the search word and the word weight of the matching keyword as the target keyword;
and sequencing the picture search results according to each matching score.
Optionally, the calculating a matching score between the target keyword and the search term of each cluster image according to the word weight of the matching keyword as the search term and the word weight of the matching keyword as the target keyword includes:
calculating to obtain the word weight of the matched keywords according to the word weight of the matched keywords as target keywords and the word weight of the matched keywords as search words for each class of cluster pictures; calculating to obtain matching weight according to the union of the search word and the target keyword and the matching keyword;
and calculating and obtaining the matching score between the target keyword and the search word of each cluster image according to the word weight of the matched keyword and the matching weight.
Optionally, the repeating image clustering of the images in the web page to obtain the cluster image of each cluster and the tagged text field set of the cluster image includes:
performing repeated picture clustering on pictures in the webpage to obtain a cluster picture of each cluster;
extracting a mark text field of the cluster-like picture from each webpage where the cluster-like picture is located;
and removing the junk text in the marked text field, and taking all marked text fields with the junk text removed as the marked text field set.
Optionally, the removing the spam text in the markup text field includes:
searching the text content in the marked text field through a preset matching mode, and removing the marked text field of which the text content is a junk text; and/or the presence of a gas in the gas,
removing the marked text field obtained from the junk web page according to the web page type corresponding to the marked text field; and/or the presence of a gas in the gas,
and removing the marked text field with the release time being earlier than the set time according to the corresponding webpage release time of the marked text field.
An embodiment of the present invention further provides a device for processing picture information, where the device includes:
the clustering unit is used for carrying out repeated picture clustering on pictures in the webpage to obtain a cluster picture of each cluster and a mark text domain set of the cluster pictures;
the word weight calculation unit is used for acquiring keywords contained in each mark text domain in the mark text domain set and word weights of the keywords according to the mark text domain set aiming at each class cluster picture, wherein the word weights are used for reflecting the correlation degree of the keywords and the class cluster pictures;
the keyword extraction unit is used for acquiring target keywords of each class cluster picture according to the word weights of all the keywords corresponding to each class cluster picture;
and the sorting unit is used for sorting the image search results according to the target keywords of each cluster image and the word weights of the target keywords.
Optionally, the word weight calculating unit is configured to:
acquiring a keyword in each marked text field;
the following target parameters are obtained for each keyword: the word frequency and the occurrence frequency of the keywords in the mark text domain and the number of the website domain names corresponding to the keywords;
and calculating the word weight of each keyword according to the target parameter of each keyword.
Optionally, the word weight calculating unit is further configured to:
aiming at each keyword, calculating and obtaining the importance degree of the keyword in each affiliated label text domain according to the word frequency and the occurrence frequency of the keyword in each affiliated label text domain, wherein the importance degree is accumulated according to the attenuation of the word frequency and the occurrence frequency;
and calculating the word weight of each keyword according to the importance degree of the keyword in all the belonging marked text fields and the number of the website domain names corresponding to the keyword.
Optionally, the sorting unit is configured to:
matching search words adopted during image search with the target keywords of each cluster image to obtain matched keywords;
calculating a matching score between the target keyword and the search word of each class of cluster picture according to the word weight of the matching keyword as the search word and the word weight of the matching keyword as the target keyword;
and sequencing the picture search results according to each matching score.
Optionally, the sorting unit is further configured to:
calculating to obtain the word weight of the matched keywords according to the word weight of the matched keywords as target keywords and the word weight of the matched keywords as search words for each class of cluster pictures; calculating to obtain matching weight according to the union of the search word and the target keyword and the matching keyword;
and calculating and obtaining the matching score between the target keyword and the search word of each cluster image according to the word weight of the matched keyword and the matching weight.
Optionally, the clustering unit is configured to:
performing repeated picture clustering on pictures in the webpage to obtain a cluster picture of each cluster;
extracting a mark text field of the cluster-like picture from each webpage where the cluster-like picture is located;
and removing the junk text in the marked text field, and taking all marked text fields with the junk text removed as the marked text field set.
Optionally, the clustering unit is further configured to:
searching the text content in the marked text field through a preset matching mode, and removing the marked text field of which the text content is a junk text; and/or the presence of a gas in the gas,
removing the marked text field obtained from the junk web page according to the web page type corresponding to the marked text field; and/or the presence of a gas in the gas,
and removing the marked text field with the release time being earlier than the set time according to the corresponding webpage release time of the marked text field.
One or more technical solutions in the embodiments of the present application have at least the following technical effects:
the embodiment of the application provides a method for processing picture information, which comprises the steps of carrying out repeated picture clustering on pictures in a webpage to obtain cluster pictures of each cluster and a marked text domain set of the cluster pictures; aiming at each cluster image, acquiring keywords contained in each mark text field in the set and word weights of the keywords according to the mark text field set, and reflecting the correlation degree of the keywords and the cluster images through the word weights; acquiring a target keyword of each class cluster picture and the word weight of the target keyword according to the word weight of the keyword; and then sorting the picture search results according to the target keywords of each class of cluster picture and the word weights of the target keywords. According to the technical scheme, a large number of marked text fields are obtained by repeating picture clustering, the target keywords of the pictures and the weights of the target keywords are selected based on the large number of marked text fields, the accuracy of picture description information is improved, the picture search results are sequenced accordingly, the technical problem that the accuracy of picture search sequencing is reduced due to page image-text inconsistency in the prior art is solved, and the accuracy of picture search sequencing is improved.
Drawings
Fig. 1 is a schematic flowchart of a search data processing method according to an embodiment of the present application;
fig. 2 is a block diagram of a data processing apparatus according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In the technical scheme provided by the embodiment of the application, the picture information processing method is provided, a large number of marked text fields are obtained by repeating picture clustering, and high-quality keywords and word weights thereof are selected from the marked text fields to search and sort pictures, so that the technical problem that the accuracy of picture searching and sorting is reduced due to page image-text inconsistency in the prior art is solved.
The main implementation principle, the specific implementation mode and the corresponding beneficial effects of the technical scheme of the embodiment of the present application are explained in detail with reference to the accompanying drawings.
Examples
Referring to fig. 1, an embodiment of the present application provides a method for processing picture information, including the following steps S11 to S14:
s11, repeating picture clustering is carried out on the pictures in the webpage, and a cluster picture of each cluster and a marked text domain set of the cluster pictures are obtained.
For example, by searching pictures at present, the total amount of original pictures searched before clustering is about 60 hundred million, and 8 hundred million clusters are found after clustering, each cluster represents a different picture, that is, 8 hundred million different pictures at present. And the picture corresponding to each cluster is called a cluster-like picture. Each cluster picture may appear in a large number of different web pages, and a mark text field of the cluster picture, such as a title of the picture, a surrounding text of the picture, a description text of the picture, and the like, can be obtained from each web page where the cluster picture is located.
And S12, aiming at each cluster picture, acquiring the keywords contained in each tagged text field in the tagged text field set and the word weights of the keywords according to the tagged text field set.
The word weight of the keyword is used for reflecting the correlation degree of the keyword and the cluster-like picture. The larger the word weight is, the greater the degree of correlation with the cluster-like picture is, the greater the importance degree of the keyword to the cluster-like picture is, and conversely, the smaller the word weight is, the smaller the degree of correlation with the cluster-like picture is, the less the importance degree of the keyword to the cluster-like picture is. By obtaining the word weight of the keyword, the importance degree of the keyword to the cluster picture can be determined.
And S13, acquiring the target keywords of each class cluster picture according to the word weights of all the keywords corresponding to each class cluster picture.
The marked text field set corresponding to each cluster-like picture contains a large number of keywords, but not all the keywords are consistent with the cluster-like picture, and high-quality keyword screening is performed through the word weights of the keywords to obtain high-quality target keywords. And using the target keywords and the word weights of the target keywords as text description information of the cluster-like pictures, and marking the cluster-like pictures.
S14, sorting the picture search results according to the target keywords of each cluster picture and the word weights of the target keywords.
And the target keywords and the word weights of the target keywords of each cluster picture are used as characteristic parameters for ranking the picture search results and participate in ranking the picture search results.
In the embodiment, a large number of marked text fields from different webpages are obtained by repeating image clustering, word weights of image keywords are obtained according to the large number of marked text fields, cluster images are marked according to the word weights, high-quality target keywords and the word weights of the high-quality target keywords are selected for the cluster images to serve as image description information, so that more accurate image description information is obtained, image search results are sequenced according to the image description information, and the accuracy of sequencing of the image search results is improved.
In a specific real-time process, when the S11 repeatedly performs picture clustering on the pictures in the web page, the pictures in the web page on the internet may be crawled first; then, storing each picture by using a ring division algorithm (since billions of pictures cannot exist in one machine, a plurality of machines are used for storing the pictures); furthermore, each picture simultaneously carries out neighbor retrieval on the pictures on all machines to find the pictures completely identical to the pictures, and then the neighbor retrieval results of the pictures are combined by utilizing and searching sets, so that all the identical pictures of each picture are associated, and a special cluster id is defined, thereby realizing the clustering of repeated pictures, and the picture corresponding to the cluster id is the cluster picture of the cluster.
After completing the clustering of the repeated pictures and obtaining the cluster pictures, S11 further receives a tagged text field of the cluster pictures from each web page where the cluster pictures are located, that is, the tagged text field of the picture is crawled from the web pages to which a large number of repeated pictures belong. In order to improve the quality of the marked text field, removing the junk text in the marked text field, and taking all the marked text fields after removing the junk text as a marked text field set of the cluster-like pictures.
Specifically, the junk text removal may be performed in at least one of the following manners:
1. and searching the text content in the marked text field through a preset matching mode, and removing the marked text field of which the text content is junk text. The preset matching mode is a single character + space matching mode, and the matching mode can dig out feature texts of types of lotteries, parasites and the like. For example: through a matching mode of single characters and spaces, junk texts of the types of static parasite ranking for generations, unique online casino appointed by an official, online dealing of a real beauty officer and the like can be mined.
2. And removing the marked text field obtained from the junk web page according to the web page type corresponding to the marked text field. The types of web pages on the internet can be generally divided into spam web pages and normal web pages, which can be filtered out for tagged text fields from spam web pages.
3. And removing the marked text field of which the publishing time is earlier than the set time according to the corresponding webpage publishing time of the marked text field. The setting time can be set according to the data quantity of the required mark text field, and the larger the quantity, the longer the setting time, and vice versa. For example: if the set time is 1 year, a certain marked text field comes from the webpage A, and the release time of the webpage A is 3 years ago, the marked text field is removed.
4. And removing the mark text field with the text size smaller than the set size according to the size of the mark text field. For example: and removing the mark text fields with the length and the width of less than 180 mm.
5. And comparing the text contents, and removing the marked text fields with repeated text contents.
According to the marked text field set obtained after the junk text is removed, S12 is executed to calculate the word weight of the keywords contained in each marked text field in the marked text field set, the importance degree of the keywords to the cluster picture is represented through the word weight, and the importance degree of the keywords is in direct proportion to the word weight. Specifically, for a marked text field set of each cluster image, keywords in each marked text field in the marked text field set are obtained; the following target parameters are obtained for each keyword: the word frequency of the keywords in the affiliated label text domain, the occurrence frequency of the keywords in the affiliated label text domain and the number of website domain names corresponding to the keywords; and calculating to obtain the word weight of each keyword according to the target parameter of each keyword.
For each keyword, calculating and obtaining the importance degree of the keyword in each affiliated label text domain according to the word frequency and the occurrence frequency of the keyword in each affiliated label text domain, wherein the importance degree is accumulated according to the attenuation of the word frequency and the occurrence frequency of the keyword, and can be specifically obtained by calculating according to the following formula two: further, the word weight of each keyword is calculated and obtained according to the importance degree of the keyword in all the belonging labeled text fields and the number of the website domain names corresponding to the keyword. Specifically, the calculation can be obtained by the following formula one:
Figure BDA0002476829290000091
Figure BDA0002476829290000092
w represents the word weight, i represents the ith tagged text field in the tagged text field set, n represents the number of text fields in the tagged text field set, f (i) represents the importance degree of the keyword in the ith tagged text field, n _ page _ domainuniqRepresenting the number of website domain names corresponding to the cluster pictures, Max _ weight representing a preset maximum weight value, norm _ k representing a normalization parameter, m representing the word frequency, and n representing the word frequencydomainRepresenting the number of times the keyword appears in the tagged text field to which it belongs. Wherein, the maximum value of the word weight can be 255 to indicate that the keyword term frequently appears in each mark text field, and the minimum value can be 1 to indicate that the keyword term rarely appears in the class cluster. norm _ k may take on a value of 4.
Based on the calculated word weights, S13 is performed to filter the keywords. Specifically, all keywords of each cluster image can be sorted according to the word weight, the top n keywords with the largest word weight are obtained as target keywords, and n is greater than or equal to 1. For example: assuming that a picture of 'lover's music 'obtained by using a large number of repeated pictures is repeated for 76 times, 50 available mark text fields are obtained according to the appearing webpage, and the word weight of each keyword in each mark text field is calculated [ lover's music: 49 ]; 26, happy plot of the lovers; a solid body is 24; 10, lovers; 10 parts of materials; 10, extracting target keywords of the picture according to the word weights of the keywords, wherein the target keywords are 49; 26, happy plot of the lovers; and (5) stereo, wherein some keywords with inconsistent or irrelevant images and texts are excluded.
After the target keyword of each cluster picture is acquired, S14 is executed to sort the picture search results. Specifically, the target keywords of each cluster-like picture can be matched with the search terms used in the picture search, and the matching score between the target keywords and the search terms of the cluster-like picture is calculated and obtained according to the matched keywords and the term weights thereof; and sorting the picture search results according to the matching score corresponding to each class of cluster picture. All target keywords corresponding to a cluster-like picture and word weights thereof form a description text of the cluster-like picture, when the target keywords of the cluster-like picture are matched with search words, the description text of the cluster-like picture is matched with the search words, and when a matching score cluster _ tf _ bmrank between the target keywords and the search words is calculated, the description text can be obtained according to a product between a text score doc _ score of the description text of the cluster-like picture and a text matching score jaccard _ match _ weight between the description text and the search words, as shown in the following formula three:
cluster _ tf _ bmrank ═ doc _ score ═ jaccard _ match _ weight equation three
The text score doc _ score is actually a word weight of a matching keyword matched with the search word among all the target keywords corresponding to the cluster-like picture. The matching keywords are the same keywords appearing in the target keywords and the search words, and for this reason, the word weights of the matching keywords can be obtained by calculation according to the word weights of the matching keywords as the target keywords and the word weights of the matching keywords as the search words. Specifically, doc _ score can be obtained by calculating according to the following formula four:
doc _ score ═ Σ match _ term _ freq _ query _ weight equation four
Wherein, match _ term _ freq represents the word weight of the matching keyword as the target keyword, and query _ weight represents the word weight of the matching keyword as the search word.
The jaccard _ match _ weight is a text matching score, represents the matching degree between all target keywords of the cluster-like picture and the search terms, is also called matching weight, and can be obtained by calculation according to the union of the search terms and the target keywords and the matching keywords, as shown in formula five:
Figure BDA0002476829290000111
the Query _ term _ freq @ Cluster _ term-freq represents that the sum of word weights of a union of a search word and a target keyword is equal to the total number of keywords in the union and the average word weight of the target keyword, and the match keyword match the word weight of the target keyword.
On the basis of repeated picture clustering, a large number of repeated doc can be obtained. The method comprises the steps of firstly filtering garbage doc through information such as time, sites, sizes and texts, then calculating word weights through related texts, sites, word distribution and the like, and finally generating features (marked as cluster _ tf _ bmrank) through matching conditions of the word weights and picture search query to participate in training of a ranking model so as to optimize ranking effects of picture search.
And obtaining the matching score cluster _ tf _ bmrank of each cluster picture according to calculation, taking the cluster _ tf _ bmrank as a one-dimensional characteristic, participating in training of a picture search result ranking model, and further optimizing the ranking effect of the picture search result. Or, the existing ranking of the picture search results is reordered according to the matching score cluster _ tf _ bmrank, and the ranking result is optimized, so that the optimized ranking result can reflect the matching degree between the picture and the search intention, and the ranking accuracy is improved.
Based on the same inventive concept, aiming at the method for processing picture information provided by the foregoing embodiment, an embodiment of the present application further provides a device for processing picture information, please refer to fig. 2, where the device includes:
the clustering unit 21 is configured to perform repeated picture clustering on pictures in a webpage to obtain a cluster picture of each cluster and a tagged text field set of the cluster pictures;
the word weight calculation unit 22 is configured to, for each class cluster picture, obtain, according to the tagged text field set, a keyword included in each tagged text field in the tagged text field set and a word weight of the keyword, where the word weight is used to reflect a degree of correlation between the keyword and the class cluster picture;
the keyword extraction unit 23 is configured to obtain a target keyword of each class cluster picture according to the word weights of all keywords corresponding to each class cluster picture;
and the sorting unit 24 is configured to sort the image search results according to the target keyword of each cluster-like image and the word weight of the target keyword.
As an alternative implementation, the word weight calculation unit 22 is configured to: acquiring a keyword in each marked text field; the following target parameters are obtained for each keyword: the word frequency and the occurrence frequency of the keywords in the mark text domain and the number of the website domain names corresponding to the keywords; and calculating the word weight of each keyword according to the target parameter of each keyword. Optionally, the word weight calculating unit 22 is further configured to: aiming at each keyword, calculating and obtaining the importance degree of the keyword in each affiliated label text domain according to the word frequency and the occurrence frequency of the keyword in each affiliated label text domain, wherein the importance degree is accumulated according to the attenuation of the word frequency and the occurrence frequency; and calculating the word weight of each keyword according to the importance degree of the keyword in all the belonging marked text fields and the number of the website domain names corresponding to the keyword.
As an optional implementation, the sorting unit 24 is configured to: matching search words adopted during image search with the target keywords of each cluster image to obtain matched keywords; calculating a matching score between the target keyword and the search word of each class of cluster picture according to the word weight of the matching keyword as the search word and the word weight of the matching keyword as the target keyword; and sequencing the picture search results according to each matching score. Optionally, the sorting unit 24 is further configured to: calculating to obtain the word weight of the matched keywords according to the word weight of the matched keywords as target keywords and the word weight of the matched keywords as search words for each class of cluster pictures; calculating to obtain matching weight according to the union of the search word and the target keyword and the matching keyword; and calculating and obtaining the matching score between the target keyword and the search word of each cluster image according to the word weight of the matched keyword and the matching weight.
As an optional implementation manner, the clustering unit 21 is configured to: performing repeated picture clustering on pictures in the webpage to obtain a cluster picture of each cluster; extracting a mark text field of the cluster-like picture from each webpage where the cluster-like picture is located; and removing the junk text in the marked text field, and taking all marked text fields with the junk text removed as the marked text field set. Optionally, as an optional implementation manner, the clustering unit 21 is further configured to: searching the text content in the marked text field through a preset matching mode, and removing the marked text field of which the text content is a junk text; and/or removing the marked text field obtained from the junk web page according to the web page type corresponding to the marked text field; and/or removing the marked text field with the release time earlier than the set time according to the corresponding webpage release time of the marked text field.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 3 is a block diagram illustrating an electronic device 800 for implementing a method for processing picture information according to an example embodiment. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 3, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/presentation (I/O) interface 812, sensor component 814, and communication component 816.
The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.
The multimedia component 808 includes a screen that provides a presentation interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 810 is configured to present and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, the audio component 810 also includes a speaker for presenting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the electronic device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
A non-transitory computer readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform a method of processing picture information, the method comprising: performing repeated picture clustering on pictures in a webpage to obtain cluster pictures of each cluster and a marked text field set of the cluster pictures; aiming at each cluster-like picture, acquiring keywords contained in each mark text field in the mark text field set and word weights of the keywords according to the mark text field set, wherein the word weights are used for reflecting the correlation degree of the keywords and the cluster-like pictures; acquiring a target keyword of each class cluster picture according to the word weight of all keywords corresponding to each class cluster picture; and sequencing the picture search results according to the target keywords of each class of cluster picture and the word weights of the target keywords.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is only limited by the appended claims
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (10)

1. A method for processing picture information, the method comprising:
performing repeated picture clustering on pictures in a webpage to obtain cluster pictures of each cluster and a marked text field set of the cluster pictures;
aiming at each cluster-like picture, acquiring keywords contained in each mark text field in the mark text field set and word weights of the keywords according to the mark text field set, wherein the word weights are used for reflecting the correlation degree of the keywords and the cluster-like pictures;
acquiring a target keyword of each class cluster picture according to the word weight of all keywords corresponding to each class cluster picture;
and sequencing the picture search results according to the target keywords of each class of cluster picture and the word weights of the target keywords.
2. The method of claim 1, wherein the obtaining keywords and word weights of the keywords contained in each tagged text field in the tagged text field set according to the tagged text field set comprises:
acquiring a keyword in each marked text field;
the following target parameters are obtained for each keyword: the word frequency and the occurrence frequency of the keywords in the mark text domain and the number of the website domain names corresponding to the keywords;
and calculating the word weight of each keyword according to the target parameter of each keyword.
3. The method of claim 2, wherein obtaining a word weight for each of the keywords from the target parameters for each of the keywords comprises:
aiming at each keyword, calculating and obtaining the importance degree of the keyword in each affiliated label text domain according to the word frequency and the occurrence frequency of the keyword in each affiliated label text domain, wherein the importance degree is accumulated according to the attenuation of the word frequency and the occurrence frequency;
and calculating the word weight of each keyword according to the importance degree of the keyword in all the belonging marked text fields and the number of the website domain names corresponding to the keyword.
4. The method of claim 1, wherein the ranking picture search results according to the target keywords and word weights of the target keywords for each cluster-like picture comprises:
matching search words adopted during image search with the target keywords of each cluster image to obtain matched keywords;
calculating a matching score between the target keyword and the search word of each class of cluster picture according to the word weight of the matching keyword as the search word and the word weight of the matching keyword as the target keyword;
and sequencing the picture search results according to each matching score.
5. The method as claimed in claim 4, wherein said calculating a matching score between said target keyword and said search word for each cluster picture based on said word weight of said matching keyword as a search word and said word weight of matching keyword as said target keyword comprises:
calculating to obtain the word weight of the matched keywords according to the word weight of the matched keywords as target keywords and the word weight of the matched keywords as search words for each class of cluster pictures; calculating to obtain matching weight according to the union of the search word and the target keyword and the matching keyword;
and calculating and obtaining the matching score between the target keyword and the search word of each cluster image according to the word weight of the matched keyword and the matching weight.
6. The method according to any one of claims 1 to 5, wherein the repeating picture clustering of the pictures in the web page to obtain the cluster pictures of each cluster and the tagged text field set of the cluster pictures comprises:
performing repeated picture clustering on pictures in the webpage to obtain a cluster picture of each cluster;
extracting a mark text field of the cluster-like picture from each webpage where the cluster-like picture is located;
and removing the junk text in the marked text field, and taking all marked text fields with the junk text removed as the marked text field set.
7. The method of claim 6, wherein said removing spam text in said tagged text field comprises:
searching the text content in the marked text field through a preset matching mode, and removing the marked text field of which the text content is a junk text; and/or the presence of a gas in the gas,
removing the marked text field obtained from the junk web page according to the web page type corresponding to the marked text field; and/or the presence of a gas in the gas,
and removing the marked text field with the release time being earlier than the set time according to the corresponding webpage release time of the marked text field.
8. An apparatus for processing picture information, the apparatus comprising:
the clustering unit is used for carrying out repeated picture clustering on pictures in the webpage to obtain a cluster picture of each cluster and a mark text domain set of the cluster pictures;
the word weight calculation unit is used for acquiring keywords contained in each mark text domain in the mark text domain set and word weights of the keywords according to the mark text domain set aiming at each class cluster picture, wherein the word weights are used for reflecting the correlation degree of the keywords and the class cluster pictures;
the keyword extraction unit is used for acquiring target keywords of each class cluster picture according to the word weights of all the keywords corresponding to each class cluster picture;
and the sorting unit is used for sorting the image search results according to the target keywords of each cluster image and the word weights of the target keywords.
9. An electronic device comprising a memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors to execute operating instructions included in the one or more programs for performing the corresponding method according to any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps corresponding to the method according to any one of claims 1 to 7.
CN202010366994.4A 2020-04-30 2020-04-30 Picture information processing method and device and electronic equipment Active CN113590861B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010366994.4A CN113590861B (en) 2020-04-30 2020-04-30 Picture information processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010366994.4A CN113590861B (en) 2020-04-30 2020-04-30 Picture information processing method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN113590861A true CN113590861A (en) 2021-11-02
CN113590861B CN113590861B (en) 2024-06-18

Family

ID=78237596

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010366994.4A Active CN113590861B (en) 2020-04-30 2020-04-30 Picture information processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN113590861B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114710554A (en) * 2022-03-30 2022-07-05 北京奇艺世纪科技有限公司 Message processing method and device, electronic equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609458A (en) * 2012-01-12 2012-07-25 北京搜狗信息服务有限公司 Method and device for picture recommendation
CN103164436A (en) * 2011-12-13 2013-06-19 阿里巴巴集团控股有限公司 Image search method and device
CN103544186A (en) * 2012-07-16 2014-01-29 富士通株式会社 Method and equipment for discovering theme key words in picture
CN103995848A (en) * 2014-05-06 2014-08-20 百度在线网络技术(北京)有限公司 Image search method and device
CN104881401A (en) * 2015-05-27 2015-09-02 大连理工大学 Patent literature clustering method
CN105354307A (en) * 2015-11-06 2016-02-24 腾讯科技(深圳)有限公司 Image content identification method and apparatus
WO2016107126A1 (en) * 2014-12-30 2016-07-07 百度在线网络技术(北京)有限公司 Image search method and device
CN109033385A (en) * 2018-07-27 2018-12-18 百度在线网络技术(北京)有限公司 Picture retrieval method, device, server and storage medium
CN110019675A (en) * 2017-12-01 2019-07-16 北京搜狗科技发展有限公司 A kind of method and device of keyword extraction
CN110765301A (en) * 2019-11-06 2020-02-07 腾讯科技(深圳)有限公司 Picture processing method, device, equipment and storage medium
CN110956038A (en) * 2019-10-16 2020-04-03 厦门美柚股份有限公司 Repeated image-text content judgment method and device

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164436A (en) * 2011-12-13 2013-06-19 阿里巴巴集团控股有限公司 Image search method and device
CN102609458A (en) * 2012-01-12 2012-07-25 北京搜狗信息服务有限公司 Method and device for picture recommendation
CN103544186A (en) * 2012-07-16 2014-01-29 富士通株式会社 Method and equipment for discovering theme key words in picture
CN103995848A (en) * 2014-05-06 2014-08-20 百度在线网络技术(北京)有限公司 Image search method and device
WO2016107126A1 (en) * 2014-12-30 2016-07-07 百度在线网络技术(北京)有限公司 Image search method and device
CN104881401A (en) * 2015-05-27 2015-09-02 大连理工大学 Patent literature clustering method
CN105354307A (en) * 2015-11-06 2016-02-24 腾讯科技(深圳)有限公司 Image content identification method and apparatus
CN110019675A (en) * 2017-12-01 2019-07-16 北京搜狗科技发展有限公司 A kind of method and device of keyword extraction
CN109033385A (en) * 2018-07-27 2018-12-18 百度在线网络技术(北京)有限公司 Picture retrieval method, device, server and storage medium
CN110956038A (en) * 2019-10-16 2020-04-03 厦门美柚股份有限公司 Repeated image-text content judgment method and device
CN110765301A (en) * 2019-11-06 2020-02-07 腾讯科技(深圳)有限公司 Picture processing method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
方爽;殷俊杰;徐武平;: "基于相似图片聚类的Web文本特征算法", 计算机工程, no. 12, 15 December 2014 (2014-12-15) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114710554A (en) * 2022-03-30 2022-07-05 北京奇艺世纪科技有限公司 Message processing method and device, electronic equipment and storage medium
CN114710554B (en) * 2022-03-30 2024-04-26 北京奇艺世纪科技有限公司 Message processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113590861B (en) 2024-06-18

Similar Documents

Publication Publication Date Title
EP3173948A1 (en) Method and apparatus for recommendation of reference documents
EP3327590B1 (en) Method and device for adjusting video playback position
CN107526744B (en) Information display method and device based on search
CN107992604B (en) Task item distribution method and related device
CN109918565B (en) Processing method and device for search data and electronic equipment
CN107315487B (en) Input processing method and device and electronic equipment
CN110019675B (en) Keyword extraction method and device
CN108073303B (en) Input method and device and electronic equipment
CN107291772B (en) Search access method and device and electronic equipment
CN111708943A (en) Search result display method and device and search result display device
CN112784142A (en) Information recommendation method and device
CN111222316B (en) Text detection method, device and storage medium
CN113590861B (en) Picture information processing method and device and electronic equipment
CN111813932B (en) Text data processing method, text data classifying device and readable storage medium
CN110175293B (en) Method and device for determining news venation and electronic equipment
CN110110046B (en) Method and device for recommending entities with same name
CN107436896A (en) Method, apparatus and electronic equipment are recommended in one kind input
CN108205534B (en) Skin resource display method and device and electronic equipment
CN112463827B (en) Query method, query device, electronic equipment and storage medium
CN113709571B (en) Video display method and device, electronic equipment and readable storage medium
CN111382367B (en) Search result ordering method and device
CN107870941B (en) Webpage sorting method, device and equipment
CN103955493A (en) Information display method and device, and mobile terminal
CN113590862A (en) Picture information processing method and device and electronic equipment
CN113407754B (en) Album generating method, apparatus, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40055320

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant