CN112884053A - Website classification method, system, equipment and medium based on image-text mixed characteristics - Google Patents

Website classification method, system, equipment and medium based on image-text mixed characteristics Download PDF

Info

Publication number
CN112884053A
CN112884053A CN202110222323.5A CN202110222323A CN112884053A CN 112884053 A CN112884053 A CN 112884053A CN 202110222323 A CN202110222323 A CN 202110222323A CN 112884053 A CN112884053 A CN 112884053A
Authority
CN
China
Prior art keywords
model
classification
vector
image
paragraph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110222323.5A
Other languages
Chinese (zh)
Other versions
CN112884053B (en
Inventor
张乐平
顾明娟
吴一超
卞豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changzhou Jiangsuan Tiancheng Information Technology Co.,Ltd.
Original Assignee
Jiangsu Jiangsuan Tiancheng Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Jiangsuan Tiancheng Information Technology Co ltd filed Critical Jiangsu Jiangsuan Tiancheng Information Technology Co ltd
Priority to CN202110222323.5A priority Critical patent/CN112884053B/en
Publication of CN112884053A publication Critical patent/CN112884053A/en
Application granted granted Critical
Publication of CN112884053B publication Critical patent/CN112884053B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a website classification method, a system, equipment and a medium based on image-text mixed characteristics, wherein the classification method comprises the following steps: converting any one text into a paragraph vector through a memory type paragraph vector model; inputting the image matrix into a model by using a ResNet model, and converting the output of the second layer from the last to the second layer into a characteristic vector as input; and (4) correlating the graph and text characteristic information matrixes, inputting the graphs and text characteristic information matrixes into an LSTM model for training, and generating a final webpage classification network. According to the method, the webpage content is represented by the model LSTM based on the image-text mixed characteristics, the image-text mixed characteristics are represented by the correlation sequence of the predicted values of the PV-DM model and the RESNET model, the model can describe the content information of the webpage, the sequence model also describes the sequence of reading articles by human, and the identification accuracy is greatly improved.

Description

Website classification method, system, equipment and medium based on image-text mixed characteristics
Technical Field
The invention relates to the field of computer image processing, in particular to a method, a system, equipment and a medium for classifying websites based on image-text mixed characteristics.
Background
With the popularization of the internet, the threshold of web site establishment is lower, and various websites without ICP records, pornographic websites, gambling websites, infringement movie websites, infringement fiction websites and other illegal websites are flooded. These websites have a very bad influence on the social development and become a hotbed for network illegal crimes. Meanwhile, the method has great impact on the copyright market and is very unfavorable for protecting the copyright. Therefore, the demand of the culture supervision department is to accurately classify the privately-set websites so as to improve the law enforcement efficiency.
The methods for classifying websites through a machine learning method mainly include the following methods:
1) based on web page text
A. The similarity between characters is explained by aiming at algorithms such as deep learning CNN;
B. classifying the texts by a machine learning method such as logistic regression and Bayes;
C. inputting by using the attribute characteristics of the webpage structure, such as html tags, CSS, various attributes and the like, and predicting by using an SVM neural network;
2) making a classification based on the website log data;
however, none of the methods solves the problem of low classification accuracy, and the accuracy rate does not exceed 80%.
Disclosure of Invention
Aiming at the problem of high error rate of the current website classification, the invention provides a website classification method, a system, equipment and a medium based on image-text mixed characteristics.
The technical scheme for realizing the purpose of the invention is as follows: a website classification method based on image-text mixed characteristics comprises the following steps:
sequentially extracting texts and pictures in the webpage;
converting a block of text into a paragraph vector through a memory distributed paragraph vector model;
inputting the image matrix into a model by using a ResNet model, then taking a tensor with the shape of (1, c, x, y) of the output of the second last layer, and converting the tensor into an image classification vector;
and respectively converting the paragraph vectors and the image classification vectors into one-dimensional sequences and associating the one-dimensional sequences, inputting the one-dimensional sequences into an LSTM model for training, and generating a final webpage classification network.
Furthermore, a text is converted into a paragraph vector through a memory type paragraph vector model, and the method comprises the following steps:
training a memory type paragraph vector model by using the existing paragraph corpus to obtain a text input model which needs to be extracted from a webpage, and obtaining a paragraph vector of which the model output result is a target text.
Further, after the image matrix is input into the model, a tensor with a shape of (1, c, x, y) is taken as a second-to-last layer output, and the specific method is as follows:
training a multi-classification ResNet model by using an existing labeled picture training set; and extracting a picture input model from the webpage training set, extracting a tensor of which the shape of the penultimate layer output by the model is (1, c, x, y), and converting the tensor into an image classification vector.
Further, the paragraph vectors and the distribution matrix based on the image theme are respectively converted into one-dimensional sequences and correlated, and input into an LSTM model for training to generate a final webpage classification network, and the specific method comprises the following steps:
using a group of texts and pictures of a text picture training set classified by a webpage as input, inputting the texts into a PV-DM model to output predicted paragraph vectors, and converting the predicted paragraph vectors into a one-dimensional sequence as input, wherein if no texts exist in the webpage group, the paragraph vectors are all 0 one-dimensional sequences;
inputting the pictures in the group into a trained picture classification ResNet model, and converting a tensor with the shape of (1, c, x, y) output by the second last layer of the model into a one-dimensional sequence as input; wherein 1 represents the length of the picture order vector, c represents the length of the classification vector, and x y represents how many (224 ) small regions are included in the image;
when x is 1 and y is 1, that is, in the case where the input image is a small image, the c classification vector is directly used as our result;
when x >1 or y >1, that is, in the case that the input image is a large graph, the tensor of (1, c, x, y) needs to be converted into the matrix of (c, x, y), then each row is used as an eigenvector, and the plurality of eigenvectors are summed and averaged to achieve the purpose of clustering, and finally the image classification vector is output.
If no picture exists in the training in the webpage group, the picture classification vector is a one-dimensional sequence of all 0 s;
and sequentially splicing the one-dimensional sequences, inputting the sequences into an LSTM model for training, and generating a final webpage classification network.
The invention also provides a website classification system based on image-text mixed characteristics, which comprises the following steps:
the paragraph vector conversion module is used for converting any one text into a paragraph vector through a memory type paragraph vector model;
an image classification vector generation module which uses a ResNet model to input the image matrix into the model, then takes a tensor with the shape of (1, c, x, y) of the output of the second last layer, and converts the tensor into an image classification vector;
and the webpage classification module is used for converting the paragraph vectors and the image classification vectors into one-dimensional sequences respectively, associating the one-dimensional sequences, inputting the one-dimensional sequences into an LSTM model for training, and generating a final webpage classification network.
An electronic device comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the website classification method based on the image-text mixing characteristics.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, implements the above-mentioned method for website classification based on teletext features.
Compared with the prior art, the invention has the beneficial effects that: according to the invention, the webpage content is represented by a model LSTM based on image-text mixed characteristics, the image-text mixed characteristics are represented by the correlation sequence of the predicted values of a PV-DM model and a ResNet model, the model can describe the content information of the webpage, the sequence model also describes the sequence of reading articles by human, the identification accuracy is greatly improved, and the identification accuracy is up to 91.3% in the existing 50M-scale webpage test set.
Drawings
Fig. 1 is a flowchart of a website classification method based on image-text mixed features.
FIG. 2 is a schematic diagram of extracting page diagram and text theme features through a PV-DM, ResNet model.
FIG. 3 is a block diagram of a process for training a web page classification LSTM model and prediction.
Detailed Description
The invention provides a website classification method based on image-text mixed characteristics, which comprises the following steps as shown in figure 1:
in the first step, any one block of text is converted into Paragraph Vectors (classified Vectors) through a Memory type Paragraph vector Model of classified Vectors (PV-DM). PV-DM better describes the semantics of a long text paragraph.
And secondly, inputting the image matrix into a model by using a ResNet model with 50 layers, and taking the last second-layer characteristics as input.
And thirdly, correlating the graph and text characteristic information matrixes, inputting the LSTM model for training, and generating a final webpage classification network.
As shown in FIG. 2, any web page has a centered body portion. The main body part can be expressed in a sequence mode, if the main body part is a simple text, the main body part can be independently expressed as a certain text characteristic, if the main body part is a simple picture, the main body part can be expressed as a certain picture characteristic, and if the main body part is a combination of a text vector and a picture vector, the main body part can be expressed as a combination of two characteristic vectors. The specific conversion is as follows:
1) sequentially extracting texts and pictures in the webpage;
2) using a group of texts and pictures of a text picture training set classified by a webpage as input, inputting the texts into a PV-DM model to output predicted paragraph vectors, and converting the predicted paragraph vectors into a one-dimensional sequence as input, wherein if no texts exist in the webpage group, the paragraph vectors are all 0 one-dimensional sequences;
3) inputting the pictures in the group into a trained picture classification ResNet model, and converting a tensor with the shape of (1, c, x, y) output by the second last layer of the model into a one-dimensional sequence as input; if no picture exists in the webpage group, the picture classification vector is a one-dimensional sequence of all 0 s;
wherein 1 represents the length of the picture order vector, c represents the length of the classification vector, and x y represents how many (224 ) small regions are included in the image;
when x is 1 and y is 1, we directly use the c classification vector as our result;
when x is greater than 1 or y is greater than 1, the tensor of (1, c, x, y) is firstly converted into the matrix of (c, x, y), then each row is taken as an eigenvector, a plurality of eigenvectors are summed and averaged, and finally an image classification vector is output;
4) and sequentially splicing the one-dimensional sequences, inputting the sequences into an LSTM model for training, and generating a final webpage classification network.
As shown in FIG. 3, the process of LSTM model training and prediction is as follows:
1) sequentially obtaining a graph and text characteristic sequence of a webpage;
2) inputting the context sequence into an LSTM (long short term memory network) for training to obtain a sequential webpage classification network model capable of representing a human reading article;
3) and obtaining a graph-text characteristic sequence of the input webpage, and obtaining a webpage classification vector after model prediction.
The model expresses the sequence of reading articles by human beings, and the logic of judging the webpage type by the context relationship by human beings is reflected by the sequence model.
The invention also provides a website classification system based on image-text mixed characteristics, which comprises the following steps:
the paragraph vector conversion module is used for converting any one text into a paragraph vector through a memory type paragraph vector model;
an image classification vector generation module which uses a ResNet model to input the image matrix into the model, then takes a tensor with the shape of (1, c, x, y) of the output of the second last layer, and converts the tensor into an image classification vector;
and the webpage classification module is used for converting the paragraph vectors and the image classification vectors into one-dimensional sequences respectively, associating the one-dimensional sequences, inputting the one-dimensional sequences into an LSTM model for training, and generating a final webpage classification network.
It should be noted that, the implementation method of each module in the system is specifically described in the website classification method section, and the present invention is not described in detail again.
The invention also provides electronic equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein when the processor executes the program, the website classification method based on the image-text mixed characteristics is realized.
Further, the present invention also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the above-mentioned website classification method based on text-text mixing features.
To test the error rate of the new model, the following three models were trained using respective different training sets:
taking out a character training set in the training set to train a CNN model based on character features;
extracting a webpage structure characteristic set in a training set to train the SVM model based on the webpage structure characteristic;
taking out a training set of image-text mixed labels in the training set to train an LSTM model based on image-text mixed characteristics;
the test results obtained based on the same 50M web page test data set are shown in the following table:
TABLE 1
Model (model) Error rate
LSTM model based on image-text mixed characteristics 9.7%
CNN model based on character characteristics 26.7%
SVM model based on webpage structure characteristics 41.9%
As can be seen from the table, the method represents the type of the webpage through the original image-text mixed characteristics, and greatly improves the accuracy of webpage classification.
In the embodiments provided in the present application, it should be understood that the disclosed method, apparatus, and device may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A website classification method based on image-text mixed characteristics is characterized by comprising the following steps:
sequentially extracting texts and pictures in the webpage;
converting a block of text into a paragraph vector through a memory distributed paragraph vector model;
inputting the image matrix into a model by using a ResNet model, then taking a tensor with the shape of (1, c, x, y) of the output of the second last layer, and converting the tensor into an image classification vector;
and respectively converting the paragraph vectors and the image classification vectors into one-dimensional sequences and associating the one-dimensional sequences, inputting the one-dimensional sequences into an LSTM model for training, and generating a final webpage classification network.
2. The method for website classification based on graph-text mixed features of claim 1, wherein a block of text is converted into a paragraph vector by a memory-based paragraph vector model, having the method:
training a memory type paragraph vector model by using the paragraph corpus, inputting texts needing to be extracted from the webpage into the model, and obtaining the paragraph vector of which the model output result is the target text.
3. The website classification method based on image-text mixed features according to claim 1, characterized in that an existing labeled picture training set is used to train a ResNet model of multiple classifications; and extracting a picture input model from the webpage training set, and extracting a tensor of which the shape of the penultimate layer output by the model is (1, c, x, y).
4. The website classification method based on image-text mixed features as claimed in claim 1, wherein the paragraph vectors and the image classification vectors are respectively converted into one-dimensional sequences and associated, and input into an LSTM model for training to generate a final webpage classification network, and the specific method is as follows:
using a group of texts and pictures of a text picture training set classified by a webpage as input, inputting the texts into a PV-DM model to output predicted paragraph vectors, and converting the predicted paragraph vectors into a one-dimensional sequence as input, wherein if no texts exist in the webpage group, the paragraph vectors are all 0 one-dimensional sequences;
inputting the pictures in the group into a trained picture classification ResNet model, and converting a tensor with the shape of (1, c, x, y) output by the second last layer of the model into a one-dimensional sequence as input; if no picture exists in the webpage group, the picture classification vector is a one-dimensional sequence of all 0 s;
wherein 1 represents the length of the picture order number vector, c represents the length of the classification vector, and x y represents how many (224 ) small regions are included in the image;
when x is 1 and y is 1, directly taking the c classification vector as a result;
when x is greater than 1 or y is greater than 1, the tensor of (1, c, x, y) is firstly converted into the matrix of (c, x, y), then each row is taken as an eigenvector, a plurality of eigenvectors are summed and averaged, and finally an image classification vector is output;
and sequentially splicing the one-dimensional sequences, inputting the sequences into an LSTM model for training, and generating a final webpage classification network.
5. A website classification system based on image-text mixed characteristics is characterized by comprising:
the paragraph vector conversion module is used for converting a text into a paragraph vector through a memory type paragraph vector model;
an image classification vector generation module which uses a ResNet model to input the image matrix into the model, then takes a tensor with the shape of (1, c, x, y) of the output of the second last layer, and converts the tensor into an image classification vector;
and the webpage classification module is used for converting the paragraph vectors and the image classification vectors into one-dimensional sequences respectively, associating the one-dimensional sequences, inputting the one-dimensional sequences into an LSTM model for training, and generating a final webpage classification network.
6. The system of claim 5, wherein a block of text is converted into a paragraph vector by the in-memory paragraph vector model, having:
training a memory type paragraph vector model by using the existing paragraph corpus, inputting texts needing to be extracted from webpages into the model, and obtaining a paragraph vector of which a model output result is a target text.
7. The system for classifying websites based on teletext features according to claim 5, wherein a tensor with a shape of (1, c, x, y) is taken from the second last layer output after the image matrix is input into the model, and specifically:
training a multi-classification ResNet model by using an existing labeled picture training set; and extracting a picture input model from the webpage training set, extracting a tensor of which the shape of the penultimate layer output by the model is (1, c, x, y), and converting the tensor into an image classification vector.
8. The website classification system based on image-text mixed features as claimed in claim 5, wherein the paragraph vectors and the image classification vectors are respectively converted into one-dimensional sequences and associated, and input into an LSTM model for training to generate a final webpage classification network, and the specific method is as follows:
using a group of texts and pictures of a text picture training set classified by a webpage as input, inputting the texts into a PV-DM model to output predicted paragraph vectors, and converting the predicted paragraph vectors into a one-dimensional sequence as input, wherein if no texts exist in the webpage group, the paragraph vectors are all 0 one-dimensional sequences;
inputting the pictures in the group into a trained picture classification ResNet model, and converting a tensor with the shape of (1, c, x, y) output by the second last layer of the model into a one-dimensional sequence as input; if no picture exists in the webpage group, the picture classification vector is a one-dimensional sequence of all 0 s;
wherein 1 represents the length of the picture order vector, c represents the length of the classification vector, and x y represents how many (224 ) small regions are included in the image;
when x is 1 and y is 1, directly taking the c classification vector as a result;
when x is greater than 1 or y is greater than 1, the tensor of (1, c, x, y) is firstly converted into the matrix of (c, x, y), then each row is taken as an eigenvector, a plurality of eigenvectors are summed and averaged, and finally an image classification vector is output;
and sequentially splicing the one-dimensional sequences, inputting the sequences into an LSTM model for training, and generating a final webpage classification network.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for website classification based on teletext features according to any one of claims 1-4 when executing the program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method for website classification based on teletext features according to any one of claims 1-4.
CN202110222323.5A 2021-02-28 2021-02-28 Website classification method, system, equipment and medium based on image-text mixed characteristics Active CN112884053B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110222323.5A CN112884053B (en) 2021-02-28 2021-02-28 Website classification method, system, equipment and medium based on image-text mixed characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110222323.5A CN112884053B (en) 2021-02-28 2021-02-28 Website classification method, system, equipment and medium based on image-text mixed characteristics

Publications (2)

Publication Number Publication Date
CN112884053A true CN112884053A (en) 2021-06-01
CN112884053B CN112884053B (en) 2022-04-15

Family

ID=76054868

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110222323.5A Active CN112884053B (en) 2021-02-28 2021-02-28 Website classification method, system, equipment and medium based on image-text mixed characteristics

Country Status (1)

Country Link
CN (1) CN112884053B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115982505A (en) * 2023-03-16 2023-04-18 北京匠数科技有限公司 Website detection method and device based on VLM

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160328384A1 (en) * 2015-05-04 2016-11-10 Sri International Exploiting multi-modal affect and semantics to assess the persuasiveness of a video
US20170270123A1 (en) * 2016-03-18 2017-09-21 Adobe Systems Incorporated Generating recommendations for media assets to be displayed with related text content
CN109934260A (en) * 2019-01-31 2019-06-25 中国科学院信息工程研究所 Image, text and data fusion sensibility classification method and device based on random forest
CN110196945A (en) * 2019-05-27 2019-09-03 北京理工大学 A kind of microblog users age prediction technique merged based on LSTM with LeNet
CN110399458A (en) * 2019-07-04 2019-11-01 淮阴工学院 A kind of Text similarity computing method based on latent semantic analysis and accidental projection
CN112287272A (en) * 2020-10-27 2021-01-29 中国科学院计算技术研究所 Method, system and storage medium for classifying website list pages

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160328384A1 (en) * 2015-05-04 2016-11-10 Sri International Exploiting multi-modal affect and semantics to assess the persuasiveness of a video
US20170270123A1 (en) * 2016-03-18 2017-09-21 Adobe Systems Incorporated Generating recommendations for media assets to be displayed with related text content
CN109934260A (en) * 2019-01-31 2019-06-25 中国科学院信息工程研究所 Image, text and data fusion sensibility classification method and device based on random forest
CN110196945A (en) * 2019-05-27 2019-09-03 北京理工大学 A kind of microblog users age prediction technique merged based on LSTM with LeNet
CN110399458A (en) * 2019-07-04 2019-11-01 淮阴工学院 A kind of Text similarity computing method based on latent semantic analysis and accidental projection
CN112287272A (en) * 2020-10-27 2021-01-29 中国科学院计算技术研究所 Method, system and storage medium for classifying website list pages

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115982505A (en) * 2023-03-16 2023-04-18 北京匠数科技有限公司 Website detection method and device based on VLM

Also Published As

Publication number Publication date
CN112884053B (en) 2022-04-15

Similar Documents

Publication Publication Date Title
CN109885692B (en) Knowledge data storage method, apparatus, computer device and storage medium
Nguyen et al. A neural local coherence model
CN107463605B (en) Method and device for identifying low-quality news resource, computer equipment and readable medium
US11288324B2 (en) Chart question answering
US9336299B2 (en) Acquisition of semantic class lexicons for query tagging
CN105022754B (en) Object classification method and device based on social network
CN111241232B (en) Business service processing method and device, service platform and storage medium
CN112417153B (en) Text classification method, apparatus, terminal device and readable storage medium
WO2022048363A1 (en) Website classification method and apparatus, computer device, and storage medium
CN110175221B (en) Junk short message identification method by combining word vector with machine learning
CN110968725B (en) Image content description information generation method, electronic device and storage medium
CN113392209A (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN111782804B (en) Text CNN-based co-distributed text data selection method, system and storage medium
CN108595426B (en) Word vector optimization method based on Chinese character font structural information
Dong et al. Cross-media similarity evaluation for web image retrieval in the wild
Aziguli et al. A robust text classifier based on denoising deep neural network in the analysis of big data
CN112884053B (en) Website classification method, system, equipment and medium based on image-text mixed characteristics
Schmitt et al. Outlier detection on semantic space for sentiment analysis with convolutional neural networks
US11989526B2 (en) Systems and methods for short text similarity based clustering
CN109446321A (en) Text classification method, text classification device, terminal and computer readable storage medium
CN115640376A (en) Text labeling method and device, electronic equipment and computer-readable storage medium
Jirathampradub et al. A 3D-CNN siamese network for motion gesture Sign Language alphabets recognition
Chen et al. Class-aware convolution and attentive aggregation for image classification
CN113962221A (en) Text abstract extraction method and device, terminal equipment and storage medium
Hong et al. Deep cross-modal hashing retrieval based on semantics preserving and vision transformer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Zhang Leping

Inventor after: Wu Yichao

Inventor after: Gu Mingjuan

Inventor after: Bian Hao

Inventor before: Zhang Leping

Inventor before: Gu Mingjuan

Inventor before: Wu Yichao

Inventor before: Bian Hao

GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: Room 608, 6th Floor, No. 66 Jingcheng Haoyuan, Zhonglou District, Nanjing City, Jiangsu Province, 213003

Patentee after: Changzhou Jiangsuan Tiancheng Information Technology Co.,Ltd.

Country or region after: China

Address before: 6 / F 608, 66 jingchenghaoyuan, Zhonglou District, Changzhou City, Jiangsu Province 213000

Patentee before: Jiangsu Jiangsuan Tiancheng Information Technology Co.,Ltd.

Country or region before: China