WO2019218514A1 - 网页目标信息的提取方法、装置及存储介质 - Google Patents
网页目标信息的提取方法、装置及存储介质 Download PDFInfo
- Publication number
- WO2019218514A1 WO2019218514A1 PCT/CN2018/102115 CN2018102115W WO2019218514A1 WO 2019218514 A1 WO2019218514 A1 WO 2019218514A1 CN 2018102115 W CN2018102115 W CN 2018102115W WO 2019218514 A1 WO2019218514 A1 WO 2019218514A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- webpage
- target
- category
- topic
- classification
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the present application relates to the field of data processing technologies, and in particular, to a method for extracting webpage target information, an electronic device, and a computer readable storage medium.
- the present application provides a method for extracting webpage target information, a server, and a computer readable storage medium, the main purpose of which is to improve the accuracy of extracting target information from a target webpage.
- the present application provides a method for extracting webpage target information, including:
- a word segmentation step receiving a request for extracting target information from a target webpage, obtaining a webpage source code of the target webpage, and performing word segmentation on the obtained webpage source code to obtain a set of available words of the target webpage;
- a topic classification step calculating a word vector of the target webpage according to the available word set of the target webpage, inputting the calculated word vector into a predetermined classification model corresponding to each topic category, and identifying a topic category to which the target webpage belongs ;
- a location prediction step determining a first tag corresponding to the target information, inputting a webpage source code of the target webpage into a location prediction model corresponding to the first tag in the identified topic category, and predicting that the target information appears differently a list of location information for the location;
- the information extraction step screening a preset number of locations with the highest probability from the location information list, and extracting information from the filtered location as the target information.
- the present application further provides an electronic device, including: a memory, a processor, and an extracting program for storing webpage target information executable on the processor, where the webpage target is stored
- an extracting program for storing webpage target information executable on the processor, where the webpage target is stored
- a word segmentation step receiving a request for extracting target information from a target webpage, obtaining a webpage source code of the target webpage, and performing word segmentation on the obtained webpage source code to obtain a set of available words of the target webpage;
- a topic classification step calculating a word vector of the target webpage according to the available word set of the target webpage, inputting the calculated word vector into a predetermined classification model corresponding to each topic category, and identifying a topic category to which the target webpage belongs ;
- a location prediction step determining a first tag corresponding to the target information, inputting a webpage source code of the target webpage into a location prediction model corresponding to the first tag in the identified topic category, and predicting that the target information appears differently a list of location information for the location;
- the information extraction step screening a preset number of locations with the highest probability from the location information list, and extracting information from the filtered location as the target information.
- the present application further provides a computer readable storage medium, where the computer readable storage medium includes an extraction program of webpage target information, and when the extraction program of the webpage target information is executed by a processor, Any step in the method of extracting the web page target information as described above is implemented.
- the method for extracting webpage target information, the electronic device and the computer readable storage medium proposed by the present application improve the classification of the target webpage by using different classification models for different topic categories to construct different classification models.
- the accuracy of the target page topic classification by constructing different location prediction models for different information categories of different topic categories, using the location prediction models corresponding to different information categories under different topic categories to predict the location information of the location where the target information is located in the target webpage
- the list improves the accuracy of the location of the predicted target information; selects the location in the location information list with the probability ranking higher and the probability greater than the probability threshold, and extracts the information from the location as the target information, thereby improving the accuracy of the target information extraction.
- FIG. 1 is a flow chart of a preferred embodiment of a method for extracting webpage target information according to the present application
- FIG. 2 is a schematic diagram of a preferred embodiment of an electronic device of the present application.
- FIG. 3 is a schematic diagram of a program module of the extraction procedure of the webpage target information in FIG.
- the application provides a method for extracting webpage target information.
- FIG. 1 it is a flowchart of a preferred embodiment of a method for extracting target information of a webpage of the present application.
- the method can be performed by a device that can be implemented by software and/or hardware.
- the method for extracting webpage target information includes steps S1-S4:
- S1 Receive a request for extracting target information from a target webpage, obtain a webpage source code of the target webpage, and perform word segmentation processing on the obtained webpage source code to obtain a set of available words of the target webpage;
- the information extraction request carries the target webpage information and the target information to be extracted, and the label corresponding to the target information is determined according to the target information to be extracted.
- the crawler tool to crawl the source code of the target webpage and perform word segmentation on the webpage source of the target webpage.
- the original data of the webpage source of the target webpage is extracted, and the irrelevant data in the original data is removed by using a regular expression, for example, Javascript script code, CSS style code, and HTML tag data.
- the retained data is segmented by the word segmentation tool, and a set of initial words separated by spaces is generated.
- the initial word set is deactivated to determine the available word set, and the available word set is used. Characterize the content of the landing page.
- the word frequency-inverse document frequency index (TF-IDF) algorithm is used to calculate the importance degree of each word in the available word set of the target webpage, and each word in the available word set of the target webpage is performed according to the order of importance from high to low. Sort.
- the top N vocabulary in the available word set of the target web page is selected as the keyword of the target web page, where N>0 and N is an integer.
- a Chinese word vector model (Word2vec model) is generated based on the Chinese Wikipedia corpus, and the word vectors of the N keywords in the available word set of the target web page are respectively calculated by the Word2vec model, and the N keys obtained by the above steps are used.
- the word vector of the word calculates the word vector for the landing page.
- the word vector of the target webpage is sequentially input into the classification model corresponding to the different subject categories that are pre-trained, for example, the classification model corresponding to the tourism category, the classification model corresponding to the economic category, and the classification corresponding to the sports category.
- the model output result of the classification model corresponding to different topic categories indicates the probability that the topic category to which the target web page belongs is each topic category. Therefore, from the output results of the classification models corresponding to the different topic categories, the topic category corresponding to the maximum probability is selected as the topic category to which the target web page belongs.
- a preset threshold for example, 0.5
- the maximum probability of the output of each classification model is selected and compared with a preset threshold, when the probability is maximum.
- the threshold is greater than or equal to the preset threshold
- the subject category corresponding to the maximum probability is used as the subject category to which the target webpage belongs.
- the probability maximum value is less than the preset threshold, the user receives the classification instruction of the topic category to which the target webpage belongs, and determines the topic category to which the target webpage belongs according to the topic category included in the classification instruction.
- the training steps of the predetermined classification model include:
- Obtaining the source code of the specified webpage respectively segmenting the source code of each specified webpage, obtaining a set of available words for each specified webpage, extracting keywords from the set of available words, and generating a word vector of each specified webpage;
- the sample data in the set is divided into a training set and a verification set, and the neural network model is trained by using the training set, and the neural network model is verified by using the verification set, and when the verification result satisfies the first preset condition, determining the Classification models corresponding to different topic types.
- the different second tags represent different subject categories to which the web page belongs, such as travel, economy, sports, politics, and entertainment.
- the word vectors of the web pages of different subject categories are respectively taken as positive samples corresponding to the subject categories.
- a negative sample needs to be constructed before the model is trained.
- the word vector of the second label is a positive type of the web page
- the second label is a negative sample of the word vector of the webpage of the other category
- Different subject categories correspond to different classification models, which improves the accuracy of web page topic classification, and lays a good foundation for predicting the location of target information and extracting target information from the target web page.
- the first tag represents the category of the target information to be extracted.
- the first tab of the webpage includes: number of days, time, per capita fee, companion, and so on.
- different first tags of the same subject category correspond to different location prediction models. Therefore, after determining the topic category to which the target webpage belongs according to the above steps, the model file of the location prediction model corresponding to the first label in the topic category is invoked, and the webpage source code of the target webpage is input into the location prediction model, and the model output result is
- the target information may appear in a list of location information at different locations in the web page source code of the target web page, and the probability that the target information appears in different locations.
- the training steps of the position prediction model include:
- Different first tags are respectively marked in the source code of each specified webpage, and the source code of each webpage in each set is respectively divided into sub-collections corresponding to the first tags, as samples corresponding to different first tags in each topic category. Data;
- the sample data in the subset is divided into a training set and a verification set, and the training set is used to train the cyclic neural network model, and the verification set is used to verify the cyclic neural network model.
- the verification result satisfies the second preset condition, A position prediction model corresponding to different first labels under each subject category is determined.
- web pages of the same subject category have a similar web page structure: a label (ie, a first label) and attribute data.
- a label ie, a first label
- the first tab of a travel page includes: number of days, time, per capita fee, companion, and subject and body information
- the first tab of a political web page includes: subject, body, time, media, and related information
- the first labels include: economic policy, foreign policy, stock information, real estate policy or national policy
- the first tabs of sports webpages include: star data, team competitions, match time and game scores, etc.
- Tags include: stars, events, time, etc.
- the webpage source code of the webpage source code of the specified webpage of the same topic category is marked with the same first label as the first label in the topic category.
- the sample data of the position prediction model It should be noted that, since the webpage source code of a webpage contains different first tags, the webpage source code of the same webpage may appear in the sample data corresponding to different first tags at the same time. In addition, the sample data includes both positive and negative samples, which will not be described here.
- 80% of the data of the first tag in the subject category is extracted as a training set, and 20% of the data is used as a verification set.
- the training set is used to train the cyclic neural network model to construct a position prediction model, and The trained position prediction model is tuned, and the calibrated position prediction model is verified by the verification set until the second preset condition is met (for example, the accuracy is greater than or equal to 95%).
- the above steps are repeated to determine a position prediction model corresponding to each of the first labels in each subject category.
- Different topic categories and different first tags correspond to different location prediction models, which improves the accuracy of location prediction and lays a good foundation for subsequent extraction of target information from target web pages.
- Obtaining the foregoing location information list reading the probability that the target information appears in different locations from the location information list, sorting the different locations according to the probability, and selecting the preset number of presets (for example, three) as the target information.
- the location and extract the information of the preset number of locations as the target information.
- a location probability threshold may be preset, and the probability that the target information appears at different positions is read from the location information list, and the preset number of the top is sorted ( For example, three positions with a probability greater than or equal to the position probability threshold are taken as the location where the target information is located, and the information of the position is extracted as the target information.
- the method for extracting webpage target information by constructing different classification models for webpages of different topic categories, classifying the target webpages by using the classification models corresponding to different topic categories, and improving the accuracy of the target webpage classification classification;
- Different location prediction models are constructed for different information categories of different subject categories, and position prediction models corresponding to different information categories under different subject categories are used to predict the location information list of the location where the target information is located in the target webpage, thereby improving the location of the predicted target information.
- Accuracy selecting the position in the position information list with the probability ranking first and the probability greater than the probability threshold, extracting information from the position as the target information, and improving the accuracy of the target information extraction.
- step S2 may be replaced by:
- the subject category with the highest similarity is used as the The subject category to which the landing page belongs;
- the classification instruction for the topic category to which the target webpage belongs is received, and the topic category included in the classification instruction is used as the topic category to which the target webpage belongs.
- the word vector of the predetermined subject categories is obtained by the following steps:
- the source code of the webpage of the specified webpage under each topic category is obtained separately, and the source code of the webpage is separately processed into words, and the available word collection of each webpage is obtained.
- the TF-IDF algorithm the importance degree of each vocabulary in the available word set of each webpage is calculated, and the top N vocabulary with the highest degree of importance is selected as the keyword of the webpage for each webpage.
- the word vector of the selected N keywords is calculated by the Word2vec model, and the word vector of the web page is calculated by the word vector of the keyword.
- the word vector of all web pages is calculated in this way.
- the keywords of all the webpages in each topic category are summarized, and the word frequency of each keyword of each webpage in each topic category is separately counted, and the word frequency reflects the weight of the keyword.
- Select the M keywords with the highest word frequency as the keywords of each topic category calculate the word vectors of each keyword summarized in the topic category by Word2vec model, and calculate the word vector of the topic category according to the word vector of the keyword and the word frequency.
- the word vector of the subject category is used as the cluster center corresponding to each topic category.
- the similarity between the word vector of the target webpage and the word vector of each topic category is calculated by the formula of the cosine similarity calculation, and the similarity of the word vector with the target webpage is selected.
- the largest word vector for the subject category It can be understood that the higher the similarity, the higher the accuracy of the target page topic classification.
- a similarity threshold is preset, when the similarity maximum is greater than or equal to the similarity threshold.
- the subject category corresponding to the similarity maximum value is used as the subject category to which the target webpage belongs; when the similarity maximum value is less than the similarity threshold, the classification instruction for the subject category to which the target webpage belongs is received, according to the theme included in the classification instruction
- the category is the subject category to which the landing page belongs.
- the method for extracting webpage target information proposed by the foregoing embodiment uses a clustering method to predetermine a cluster center (word vector) corresponding to each topic category, and calculates a cluster corresponding to each of the predetermined topic categories by calculating a word vector of the target webpage.
- the similarity of the center selects the topic category corresponding to the maximum similarity of the preset condition as the topic category to which the target webpage belongs, so that the webpage topic classification is more accurate.
- the application also provides an electronic device.
- FIG. 2 it is a schematic diagram of a preferred embodiment of the electronic device 1 of the present application.
- the electronic device 1 may be a terminal device with a data processing function, such as a server, a smart phone, a tablet computer, a portable computer, a desktop computer, etc.
- the server may be a rack server, a blade server, or a tower. Server or rack server.
- the electronic device 1 includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.
- the memory 11 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (for example, an SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like.
- the memory 11 may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1, in some embodiments.
- the memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in hard disk equipped on the electronic device 1, a smart memory card (SMC), and a secure digital (Secure Digital) , SD) cards, flash cards, etc. Further, the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
- the memory 11 can be used not only for storing application software and various types of data installed in the electronic device 1, such as the extraction program 10 of the web page target information, but also for temporarily storing data that has been output or is to be output.
- the processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data processing chip for running program code or processing stored in the memory 11.
- Data such as an extraction program 10 of web page target information, and the like.
- Communication bus 13 is used to implement connection communication between these components.
- the network interface 14 can optionally include a standard wired interface, a wireless interface (such as a WI-FI interface), and is generally used to establish a communication connection between the electronic device 1 and other electronic devices.
- a standard wired interface such as a WI-FI interface
- FIG. 2 shows only the electronic device 1 having the components 11-14. It will be understood by those skilled in the art that the structure shown in FIG. 2 does not constitute a limitation on the electronic device 1, and may include fewer or more than the illustration. Multiple components, or a combination of certain components, or different component arrangements.
- the electronic device 1 may further include a user interface
- the user interface may include a display, an input unit such as a keyboard, and the optional user interface may further include a standard wired interface and a wireless interface.
- the display may be an LED display, a liquid crystal display, a touch liquid crystal display, and an Organic Light-Emitting Diode (OLED) touch device.
- the display may also be referred to as a display screen or display unit for displaying information processed in the electronic device 1 and a user interface for displaying visualizations.
- the program code of the extraction program 10 storing the webpage target information in the memory 11 as a computer storage medium, when the processor 12 executes the program code of the extraction program 10 of the webpage target information , to achieve the following steps:
- the word segmentation step receiving a request for extracting target information from the target webpage, obtaining a webpage source code of the target webpage, and performing word segmentation on the obtained webpage source code to obtain a set of available words of the target webpage.
- the information extraction request carries the target webpage information and the target information to be extracted, and the label corresponding to the target information is determined according to the target information to be extracted.
- the crawler tool to crawl the source code of the target webpage and perform word segmentation on the webpage source of the target webpage.
- the original data of the webpage source of the target webpage is extracted, and the irrelevant data in the original data is removed by using a regular expression, for example, Javascript script code, CSS style code, and HTML tag data.
- the retained data is segmented by the word segmentation tool, and a set of initial words separated by spaces is generated.
- the initial word set is deactivated to determine the available word set, and the available word set is used. Characterize the content of the landing page.
- a topic classification step calculating a word vector of the target webpage according to the available word set of the target webpage, inputting the calculated word vector into a predetermined classification model corresponding to each topic category, and identifying a topic category to which the target webpage belongs .
- the word frequency-inverse document frequency index (TF-IDF) algorithm is used to calculate the importance degree of each word in the available word set of the target webpage, and each word in the available word set of the target webpage is performed according to the order of importance from high to low. Sort.
- the top N vocabulary in the available word set of the target web page is selected as the keyword of the target web page, where N>0 and N is an integer.
- a Chinese word vector model (Word2vec model) is generated based on the Chinese Wikipedia corpus, and the word vectors of the N keywords in the available word set of the target web page are respectively calculated by the Word2vec model, and the N keys obtained by the above steps are used.
- the word vector of the word calculates the word vector for the landing page.
- the word vector of the target webpage is sequentially input into the classification model corresponding to the different subject categories that are pre-trained, for example, the classification model corresponding to the tourism category, the classification model corresponding to the economic category, and the classification corresponding to the sports category.
- model output result of the classification model corresponding to different topic categories indicates the probability that the topic category to which the target web page belongs is each topic category.
- the model output result of the classification model corresponding to different topic categories indicates the probability that the topic category to which the target web page belongs is each topic category. Therefore, from the output results of the classification models corresponding to the different topic categories, the topic category corresponding to the maximum probability is selected as the topic category to which the target web page belongs.
- a preset threshold for example, 0.5
- the maximum probability of the output of each classification model is selected and compared with a preset threshold, when the probability is maximum.
- the threshold is greater than or equal to the preset threshold
- the subject category corresponding to the maximum probability is used as the subject category to which the target webpage belongs.
- the probability maximum value is less than the preset threshold, the user receives the classification instruction of the topic category to which the target webpage belongs, and determines the topic category to which the target webpage belongs according to the topic category included in the classification instruction.
- the training steps of the predetermined classification model include:
- a second label is marked for the predetermined webpage according to the topic category to which the webpage belongs.
- the different second tags represent different subject categories to which the web page belongs, such as travel, economy, sports, politics, and entertainment.
- the web pages of different subject categories and the corresponding word vectors are respectively taken as positive samples corresponding to different subject categories. In order to ensure the accuracy of the classification model, a negative sample needs to be constructed before the model is trained.
- the word vector of the second label is a positive type of the web page
- the second label is a negative sample of the word vector of the webpage of the other category
- the sample set corresponding to the different subject categories [X , Y] where X is a word vector corresponding to a certain topic category webpage, and Y is a topic category corresponding to the word vector.
- Different subject categories correspond to different classification models, which improves the accuracy of web page topic classification, and lays a good foundation for predicting the location of target information and extracting target information from the target web page.
- a location prediction step determining a first tag corresponding to the target information, inputting a webpage source code of the target webpage into a location prediction model corresponding to the first tag in the identified topic category, and predicting that the target information appears differently A list of location information for the location.
- the first tag represents the category of the target information to be extracted.
- the first tab of the webpage includes: number of days, time, per capita fee, companion, and so on.
- different first tags of the same subject category correspond to different location prediction models. Therefore, after determining the topic category to which the target webpage belongs according to the above steps, the model file of the location prediction model corresponding to the first label in the topic category is invoked, and the webpage source code of the target webpage is input into the location prediction model, and the model output result is
- the target information may appear in a list of location information at different locations in the web page source code of the target web page, and the probability that the target information appears in different locations.
- the training steps of the position prediction model include:
- Different first tags are respectively marked in the source code of each specified webpage, and the source code of each webpage in each set is respectively divided into sub-collections corresponding to the first tags, as samples corresponding to different first tags in each topic category. Data;
- the sample data in the subset is divided into a training set and a verification set, and the training set is used to train the cyclic neural network model, and the verification set is used to verify the cyclic neural network model.
- the verification result satisfies the second preset condition, A position prediction model corresponding to different first labels under each subject category is determined.
- web pages of the same subject category have a similar web page structure: a label (ie, a first label) and attribute data.
- a label ie, a first label
- the first tab of a travel page includes: number of days, time, per capita fee, companion, and subject and body information
- the first tab of a political web page includes: subject, body, time, media, and related information
- the first labels include: economic policy, foreign policy, stock information, real estate policy or national policy
- the first tabs of sports webpages include: star data, team competitions, match time and game scores, etc.
- Tags include: stars, events, time, etc.
- the webpage source code of the webpage source code of the specified webpage of the same topic category is marked with the same first label as the first label in the topic category.
- the sample data of the position prediction model It should be noted that, since the webpage source code of a webpage contains different first tags, the webpage source code of the same webpage may appear in the sample data corresponding to different first tags at the same time. In addition, the sample data includes both positive and negative samples, which will not be described here.
- 80% of the data of the first tag in the subject category is extracted as a training set, and 20% of the data is used as a verification set.
- the training set is used to train the cyclic neural network model to construct a position prediction model, and The trained position prediction model is tuned, and the calibrated position prediction model is verified by the verification set until the second preset condition is met (for example, the accuracy is greater than or equal to 95%).
- the above steps are repeated to determine a position prediction model corresponding to each of the first labels in each subject category.
- Different topic categories and different first tags correspond to different location prediction models, which improves the accuracy of location prediction and lays a good foundation for subsequent extraction of target information from target web pages.
- the information extraction step screening a preset number of locations with the highest probability from the location information list, and extracting information from the filtered location as the target information.
- Obtaining the foregoing location information list reading the probability that the target information appears in different locations from the location information list, sorting the different locations according to the probability, and selecting the preset number of presets (for example, three) as the target information.
- the location and extract the information of the preset number of locations as the target information.
- a location probability threshold may be preset, and the probability that the target information appears at different positions is read from the location information list, and the preset number of the top is sorted ( For example, three positions with a probability greater than or equal to the position probability threshold are taken as the location where the target information is located, and the information of the position is extracted as the target information.
- the electronic device 1 proposed in the above embodiment, by constructing different classification models for web pages of different subject categories, classifying the target webpages by using the classification models corresponding to different topic categories, and improving the accuracy of the target webpage topic classification; Different information categories of different categories are used to construct different position prediction models, and position prediction models corresponding to different information categories under different subject categories are used to predict the position information list of the location where the target information is located in the target webpage, thereby improving the accuracy of the location of the predicted target information. Selecting the position in the position information list with the probability ranking first and the probability greater than the probability threshold, and extracting information from the position as the target information improves the accuracy of the target information extraction.
- the extraction program 10 of the webpage target information may also be divided into one or more modules, one or more modules being stored in the memory 11 and being processed by one or more processors ( This embodiment is executed by the processor 12) to accomplish the present application.
- a module referred to herein refers to a series of computer program instructions that are capable of performing a particular function.
- FIG. 3 it is a block diagram of the extraction program 10 of the webpage target information in FIG. 2.
- the webpage target information extraction program 10 can be divided into a word segmentation module 110, a topic classification module 120, and a position prediction.
- the module 130 and the information extraction module 140, the functions or operation steps implemented by the modules 110-140 are similar to the above, and are not described in detail herein, for example, where:
- the word segmentation module 110 is configured to receive a request for extracting target information from the target webpage, obtain a webpage source code of the target webpage, and perform word segmentation processing on the obtained webpage source code to obtain a set of available words of the target webpage;
- the topic classification module 120 is configured to calculate a word vector of the target webpage according to the available word set of the target webpage, input the calculated word vector into a predetermined classification model corresponding to each topic category, and identify that the target webpage belongs to Subject category;
- the location prediction module 130 is configured to determine a first label corresponding to the target information, input the webpage source code of the target webpage into a location prediction model corresponding to the first label in the identified topic category, and predict the target information. a list of location information that appears in different locations;
- the information extraction module 140 is configured to filter a preset number of locations with the highest probability from the location information list, and extract information from the filtered location as the target information.
- the embodiment of the present application further provides a computer readable storage medium, where the computer readable storage medium includes an extracting program 10 of webpage target information, and the extracting program 10 of the webpage target information is executed by a processor to implement the following operations. :
- a word segmentation step receiving a request for extracting target information from a target webpage, obtaining a webpage source code of the target webpage, and performing word segmentation on the obtained webpage source code to obtain a set of available words of the target webpage;
- a topic classification step calculating a word vector of the target webpage according to the available word set of the target webpage, inputting the calculated word vector into a predetermined classification model corresponding to each topic category, and identifying a topic category to which the target webpage belongs ;
- a location prediction step determining a first tag corresponding to the target information, inputting a webpage source code of the target webpage into a location prediction model corresponding to the first tag in the identified topic category, and predicting that the target information appears differently a list of location information for the location;
- the information extraction step screening a preset number of locations with the highest probability from the location information list, and extracting information from the filtered location as the target information.
- the specific implementation manner of the computer readable storage medium of the present application is substantially the same as the specific implementation manner of the method for extracting the webpage target information, and details are not described herein again.
- the technical solution of the present application which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM as described above). , a disk, an optical disk, including a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the methods described in the various embodiments of the present application.
- a terminal device which may be a mobile phone, a computer, a server, or a network device, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (20)
- 一种网页目标信息的提取方法,应用于电子装置,其特征在于,所述方法包括:分词步骤:接收从目标网页中提取目标信息的请求,获取所述目标网页的网页源码,对获取到的网页源码进行分词处理得到所述目标网页的可用词集合;主题分类步骤:根据所述目标网页的可用词集合计算所述目标网页的词向量,将计算得到的词向量输入预先确定的各主题类别对应的分类模型,识别出所述目标网页所属的主题类别;位置预测步骤:确定所述目标信息对应的第一标签,将所述目标网页的网页源码输入识别出的主题类别中所述第一标签对应的位置预测模型中,预测所述目标信息出现在不同位置的位置信息列表;及信息提取步骤:从所述位置信息列表中筛选出预设数量的概率最高的位置,并从筛选出的位置提取信息作为目标信息。
- 根据权利要求1所述的网页目标信息的提取方法,其特征在于,所述“识别出所述目标网页所属的主题类别”的步骤包括:选择所述分类模型的输出结果中概率最高值对应的主题类别,作为所述目标网页所属的主题类别。
- 根据权利要求2所述的网页目标信息的提取方法,其特征在于,所述主题分类步骤可以替换为:分别计算所述目标网页的词向量与预先确定的各主题类别的词向量之间的相似度,当相似度最大值大于或等于预设相似度阈值时,将相似度最高的主题类别作为所述目标网页所属的主题类别;及当相似度最大值小于预设相似度阈值时,接收针对目标网页所属的主题类别的分类指令,根据分类指令中包含的主题类别作为目标网页所属的主题类别。
- 根据权利要求1所述的网页目标信息的提取方法,其特征在于,所述分类模型的训练步骤包括:获取指定网页的网页源码,分别对每个指定网页的网页源码进行分词,得到每个指定网页的可用词集合,从可用词集合中提取关键词,并生成每个 指定网页的词向量;分别为每个指定网页标注第二标签,将所述词向量划分至不同第二标签对应的集合中,作为不同主题类别的样本数据;及将所述集合中的样本数据划分为训练集及验证集,利用训练集对神经网络模型进行训练,利用验证集对神经网络模型进行验证,当验证结果满足第一预设条件时,确定所述不同主题类型对应的分类模型。
- 根据权利要求4所述的网页目标信息的提取方法,其特征在于,所述位置预测模型的训练步骤包括:分别为每个指定网页标注所述第二标签,根据第二标签将所述指定网页的网页源码划分至不同主题类别对应的集合中;分别在每个指定网页的网页源码中标注不同的第一标签,分别将每个集合中的网页源码划分至各第一标签对应的子集合中,作为各主题类别下不同第一标签对应的样本数据;及将所述子集合中的样本数据划分为训练集及验证集,利用训练集对循环神经网络模型进行训练,利用验证集对循环神经网络模型进行验证,当验证结果满足第二预设条件时,确定各主题类别下不同第一标签对应的位置预测模型。
- 根据权利要求5所述的网页目标信息的提取方法,其特征在于,所述“识别出所述目标网页所属的主题类别”的步骤包括:选择所述分类模型的输出结果中概率最高值对应的主题类别,作为所述目标网页所属的主题类别。
- 根据权利要求6所述的网页目标信息的提取方法,其特征在于,所述主题分类步骤可以替换为:分别计算所述目标网页的词向量与预先确定的各主题类别的词向量之间的相似度,当相似度最大值大于或等于预设相似度阈值时,将相似度最高的主题类别作为所述目标网页所属的主题类别;及当相似度最大值小于预设相似度阈值时,接收针对目标网页所属的主题类别的分类指令,根据分类指令中包含的主题类别作为目标网页所属的主题类别。
- 一种电子装置,其特征在于,该装置包括:存储器、处理器,所述存 储器上存储有可在所述处理器上运行的网页目标信息的提取程序,所述网页目标信息的提取程序被所述处理器执行时,可实现如下步骤:分词步骤:接收从目标网页中提取目标信息的请求,获取所述目标网页的网页源码,对获取到的网页源码进行分词处理得到所述目标网页的可用词集合;主题分类步骤:根据所述目标网页的可用词集合计算所述目标网页的词向量,将计算得到的词向量输入预先确定的各主题类别对应的分类模型,识别出所述目标网页所属的主题类别;位置预测步骤:确定所述目标信息对应的第一标签,将所述目标网页的网页源码输入识别出的主题类别中所述第一标签对应的位置预测模型中,预测所述目标信息出现在不同位置的位置信息列表;及信息提取步骤:从所述位置信息列表中筛选出预设数量的概率最高的位置,并从筛选出的位置提取信息作为目标信息。
- 根据权利要求8所述的电子装置,其特征在于,所述“识别出所述目标网页所属的主题类别”的步骤包括:选择所述分类模型的输出结果中概率最高值对应的主题类别,作为所述目标网页所属的主题类别。
- 根据权利要求9所述的电子装置,其特征在于,所述主题分类步骤可以替换为:分别计算所述目标网页的词向量与预先确定的各主题类别的词向量之间的相似度,当相似度最大值大于或等于预设相似度阈值时,将相似度最高的主题类别作为所述目标网页所属的主题类别;当相似度最大值小于预设相似度阈值时,接收针对目标网页所属的主题类别的分类指令,根据分类指令中包含的主题类别作为目标网页所属的主题类别。
- 根据权利要求10所述的电子装置,其特征在于,所述分类模型的训练步骤包括:获取指定网页的网页源码,分别对每个指定网页的网页源码进行分词,得到每个指定网页的可用词集合,从可用词集合中提取关键词,并生成每个指定网页的词向量;分别为每个指定网页标注第二标签,将所述词向量划分至不同第二标签对应的集合中,作为不同主题类别的样本数据;及将所述集合中的样本数据划分为训练集及验证集,利用训练集对神经网络模型进行训练,利用验证集对神经网络模型进行验证,当验证结果满足第一预设条件时,确定所述不同主题类型对应的分类模型。
- 根据权利要求11所述的电子装置,其特征在于,所述位置预测模型的训练步骤包括:分别为每个指定网页标注所述第二标签,根据第二标签将所述指定网页的网页源码划分至不同主题类别对应的集合中;分别在每个指定网页的网页源码中标注不同的第一标签,分别将每个集合中的网页源码划分至各第一标签对应的子集合中,作为各主题类别下不同第一标签对应的样本数据;及将所述子集合中的样本数据划分为训练集及验证集,利用训练集对循环神经网络模型进行训练,利用验证集对循环神经网络模型进行验证,当验证结果满足第二预设条件时,确定各主题类别下不同第一标签对应的位置预测模型。
- 根据权利要求12所述的电子装置,其特征在于,所述“识别出所述目标网页所属的主题类别”的步骤包括:选择所述分类模型的输出结果中概率最高值对应的主题类别,作为所述目标网页所属的主题类别。
- 根据权利要求13所述的电子装置,其特征在于,所述主题分类步骤可以替换为:分别计算所述目标网页的词向量与预先确定的各主题类别的词向量之间的相似度,当相似度最大值大于或等于预设相似度阈值时,将相似度最高的主题类别作为所述目标网页所属的主题类别;及当相似度最大值小于预设相似度阈值时,接收针对目标网页所属的主题类别的分类指令,根据分类指令中包含的主题类别作为目标网页所属的主题类别。
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中包括网页目标信息的提取程序,所述网页目标信息的提取程序被所述处理 器执行时,可实现如下步骤:分词步骤:接收从目标网页中提取目标信息的请求,获取所述目标网页的网页源码,对获取到的网页源码进行分词处理得到所述目标网页的可用词集合;主题分类步骤:根据所述目标网页的可用词集合计算所述目标网页的词向量,将计算得到的词向量输入预先确定的各主题类别对应的分类模型,识别出所述目标网页所属的主题类别;位置预测步骤:确定所述目标信息对应的第一标签,将所述目标网页的网页源码输入识别出的主题类别中所述第一标签对应的位置预测模型中,预测所述目标信息出现在不同位置的位置信息列表;及信息提取步骤:从所述位置信息列表中筛选出预设数量的概率最高的位置,并从筛选出的位置提取信息作为目标信息。
- 根据权利要求15所述的计算机可读存储介质,其特征在于,所述“识别出所述目标网页所属的主题类别”的步骤包括:选择所述分类模型的输出结果中概率最高值对应的主题类别,作为所述目标网页所属的主题类别。
- 根据权利要求16所述的计算机可读存储介质,其特征在于,所述主题分类步骤可以替换为:分别计算所述目标网页的词向量与预先确定的各主题类别的词向量之间的相似度,当相似度最大值大于或等于预设相似度阈值时,将相似度最高的主题类别作为所述目标网页所属的主题类别;当相似度最大值小于预设相似度阈值时,接收针对目标网页所属的主题类别的分类指令,根据分类指令中包含的主题类别作为目标网页所属的主题类别。
- 根据权利要求15所述的计算机可读存储介质,其特征在于,所述分类模型的训练步骤包括:获取指定网页的网页源码,分别对每个指定网页的网页源码进行分词,得到每个指定网页的可用词集合,从可用词集合中提取关键词,并生成每个指定网页的词向量;分别为每个指定网页标注第二标签,将所述词向量划分至不同第二标签 对应的集合中,作为不同主题类别的样本数据;及将所述集合中的样本数据划分为训练集及验证集,利用训练集对神经网络模型进行训练,利用验证集对神经网络模型进行验证,当验证结果满足第一预设条件时,确定所述不同主题类型对应的分类模型。
- 根据权利要求18所述的计算机可读存储介质,其特征在于,所述位置预测模型的训练步骤包括:分别为每个指定网页标注所述第二标签,根据第二标签将所述指定网页的网页源码划分至不同主题类别对应的集合中;分别在每个指定网页的网页源码中标注不同的第一标签,分别将每个集合中的网页源码划分至各第一标签对应的子集合中,作为各主题类别下不同第一标签对应的样本数据;及将所述子集合中的样本数据划分为训练集及验证集,利用训练集对循环神经网络模型进行训练,利用验证集对循环神经网络模型进行验证,当验证结果满足第二预设条件时,确定各主题类别下不同第一标签对应的位置预测模型。
- 根据权利要求19所述的计算机可读存储介质,其特征在于,所述主题分类步骤可以替换为:分别计算所述目标网页的词向量与预先确定的各主题类别的词向量之间的相似度,当相似度最大值大于或等于预设相似度阈值时,将相似度最高的主题类别作为所述目标网页所属的主题类别;及当相似度最大值小于预设相似度阈值时,接收针对目标网页所属的主题类别的分类指令,根据分类指令中包含的主题类别作为目标网页所属的主题类别。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810455840.5A CN108629043B (zh) | 2018-05-14 | 2018-05-14 | 网页目标信息的提取方法、装置及存储介质 |
CN201810455840.5 | 2018-05-14 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019218514A1 true WO2019218514A1 (zh) | 2019-11-21 |
Family
ID=63693220
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/102115 WO2019218514A1 (zh) | 2018-05-14 | 2018-08-24 | 网页目标信息的提取方法、装置及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108629043B (zh) |
WO (1) | WO2019218514A1 (zh) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111124916A (zh) * | 2019-12-23 | 2020-05-08 | 北京云聚智慧科技有限公司 | 一种基于动作语义向量的模型训练方法和电子设备 |
CN111832298A (zh) * | 2020-07-14 | 2020-10-27 | 北京百度网讯科技有限公司 | 病历的质检方法、装置、设备以及存储介质 |
CN112101819A (zh) * | 2020-10-28 | 2020-12-18 | 平安国际智慧城市科技股份有限公司 | 食品风险预测方法、装置、设备及存储介质 |
CN112328833A (zh) * | 2020-11-09 | 2021-02-05 | 腾讯科技(深圳)有限公司 | 标签处理方法、装置及计算机可读存储介质 |
CN113536778A (zh) * | 2020-04-14 | 2021-10-22 | 北京沃东天骏信息技术有限公司 | 标题的生成方法、装置和计算机可读存储介质 |
CN113761326A (zh) * | 2020-06-17 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | 一种过滤相似产品的方法和装置 |
CN115618291A (zh) * | 2022-10-14 | 2023-01-17 | 吉林省吉林祥云信息技术有限公司 | 一种基于Transformer的web指纹识别方法、***、设备以及存储介质 |
CN116975410A (zh) * | 2023-09-22 | 2023-10-31 | 北京中关村科金技术有限公司 | 网页数据采集方法、装置、电子设备及可读存储介质 |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111191095A (zh) * | 2018-11-14 | 2020-05-22 | ***通信集团河北有限公司 | 网页数据获取方法、装置、设备及介质 |
CN109657710B (zh) * | 2018-12-06 | 2022-01-21 | 北京达佳互联信息技术有限公司 | 数据筛选方法、装置、服务器及存储介质 |
CN109634922A (zh) * | 2018-12-06 | 2019-04-16 | 苏州科创风云信息技术有限公司 | 共享货架中资源的分类方法及装置 |
CN109960725B (zh) * | 2019-01-17 | 2024-06-21 | 平安科技(深圳)有限公司 | 基于情感的文本分类处理方法、装置和计算机设备 |
CN109992344A (zh) * | 2019-03-29 | 2019-07-09 | 珠海豹好玩科技有限公司 | 网页处理方法、***、设备及计算机可读存储介质 |
CN110110127B (zh) * | 2019-05-05 | 2023-07-18 | 深圳劲嘉集团股份有限公司 | 一种识别专色混合油墨的基色油墨的方法以及电子设备 |
CN110427618B (zh) * | 2019-07-22 | 2021-03-16 | 清华大学 | 对抗样本生成方法、介质、装置和计算设备 |
CN111401935B (zh) * | 2020-02-21 | 2023-04-07 | 中国平安财产保险股份有限公司 | 资源分配方法、装置及存储介质 |
CN111428489B (zh) * | 2020-03-19 | 2023-08-29 | 北京百度网讯科技有限公司 | 一种评论生成方法、装置、电子设备及存储介质 |
CN113268651B (zh) * | 2021-05-27 | 2023-06-06 | 清华大学 | 一种搜索信息的摘要自动生成方法及装置 |
CN113254751B (zh) * | 2021-06-24 | 2021-09-21 | 北森云计算有限公司 | 一种复杂网页结构化信息精确提取方法、设备及存储介质 |
TWI827984B (zh) * | 2021-10-05 | 2024-01-01 | 台灣大哥大股份有限公司 | 網站分類系統及方法 |
CN114996622B (zh) * | 2022-08-02 | 2022-11-11 | 北京弘玑信息技术有限公司 | 信息获取方法、值网络模型的训练方法及电子设备 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103678310A (zh) * | 2012-08-31 | 2014-03-26 | 腾讯科技(深圳)有限公司 | 网页主题的分类方法及装置 |
CN106156204A (zh) * | 2015-04-23 | 2016-11-23 | 深圳市腾讯计算机***有限公司 | 文本标签的提取方法和装置 |
US20180039696A1 (en) * | 2016-08-08 | 2018-02-08 | Baidu Usa Llc | Knowledge graph entity reconciler |
CN107862039A (zh) * | 2017-11-06 | 2018-03-30 | 工业和信息化部电子第五研究所 | 网页数据获取方法、***和数据匹配推送方法 |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101094194B (zh) * | 2006-06-19 | 2010-06-23 | 腾讯科技(深圳)有限公司 | 一种提取Web页面中用户所需Web信息的方法 |
CN101593200B (zh) * | 2009-06-19 | 2012-10-03 | 淮海工学院 | 基于关键词频度分析的中文网页分类方法 |
CN101794311B (zh) * | 2010-03-05 | 2012-06-13 | 南京邮电大学 | 基于模糊数据挖掘的中文网页自动分类方法 |
CN105589913A (zh) * | 2015-06-15 | 2016-05-18 | 广州市动景计算机科技有限公司 | 一种提取页面信息的方法及装置 |
CN105786951A (zh) * | 2015-12-31 | 2016-07-20 | 北京金山安全软件有限公司 | 一种网页中内容块的提取方法、装置及服务器 |
-
2018
- 2018-05-14 CN CN201810455840.5A patent/CN108629043B/zh active Active
- 2018-08-24 WO PCT/CN2018/102115 patent/WO2019218514A1/zh active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103678310A (zh) * | 2012-08-31 | 2014-03-26 | 腾讯科技(深圳)有限公司 | 网页主题的分类方法及装置 |
CN106156204A (zh) * | 2015-04-23 | 2016-11-23 | 深圳市腾讯计算机***有限公司 | 文本标签的提取方法和装置 |
US20180039696A1 (en) * | 2016-08-08 | 2018-02-08 | Baidu Usa Llc | Knowledge graph entity reconciler |
CN107862039A (zh) * | 2017-11-06 | 2018-03-30 | 工业和信息化部电子第五研究所 | 网页数据获取方法、***和数据匹配推送方法 |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111124916A (zh) * | 2019-12-23 | 2020-05-08 | 北京云聚智慧科技有限公司 | 一种基于动作语义向量的模型训练方法和电子设备 |
CN111124916B (zh) * | 2019-12-23 | 2023-04-07 | 北京云聚智慧科技有限公司 | 一种基于动作语义向量的模型训练方法和电子设备 |
CN113536778A (zh) * | 2020-04-14 | 2021-10-22 | 北京沃东天骏信息技术有限公司 | 标题的生成方法、装置和计算机可读存储介质 |
CN113761326A (zh) * | 2020-06-17 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | 一种过滤相似产品的方法和装置 |
CN111832298B (zh) * | 2020-07-14 | 2024-03-01 | 北京百度网讯科技有限公司 | 病历的质检方法、装置、设备以及存储介质 |
CN111832298A (zh) * | 2020-07-14 | 2020-10-27 | 北京百度网讯科技有限公司 | 病历的质检方法、装置、设备以及存储介质 |
CN112101819A (zh) * | 2020-10-28 | 2020-12-18 | 平安国际智慧城市科技股份有限公司 | 食品风险预测方法、装置、设备及存储介质 |
CN112328833A (zh) * | 2020-11-09 | 2021-02-05 | 腾讯科技(深圳)有限公司 | 标签处理方法、装置及计算机可读存储介质 |
CN112328833B (zh) * | 2020-11-09 | 2024-03-26 | 腾讯科技(深圳)有限公司 | 标签处理方法、装置及计算机可读存储介质 |
CN115618291A (zh) * | 2022-10-14 | 2023-01-17 | 吉林省吉林祥云信息技术有限公司 | 一种基于Transformer的web指纹识别方法、***、设备以及存储介质 |
CN115618291B (zh) * | 2022-10-14 | 2023-09-29 | 吉林省吉林祥云信息技术有限公司 | 一种基于Transformer的web指纹识别方法、***、设备以及存储介质 |
CN116975410B (zh) * | 2023-09-22 | 2023-12-19 | 北京中关村科金技术有限公司 | 网页数据采集方法、装置、电子设备及可读存储介质 |
CN116975410A (zh) * | 2023-09-22 | 2023-10-31 | 北京中关村科金技术有限公司 | 网页数据采集方法、装置、电子设备及可读存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN108629043B (zh) | 2023-05-12 |
CN108629043A (zh) | 2018-10-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019218514A1 (zh) | 网页目标信息的提取方法、装置及存储介质 | |
CN109325165B (zh) | 网络舆情分析方法、装置及存储介质 | |
CN109145216B (zh) | 网络舆情监控方法、装置及存储介质 | |
CN109145215B (zh) | 网络舆情分析方法、装置及存储介质 | |
WO2021068339A1 (zh) | 文本分类方法、装置及计算机可读存储介质 | |
CN107992596B (zh) | 一种文本聚类方法、装置、服务器和存储介质 | |
WO2019227710A1 (zh) | 网络舆情的分析方法、装置及计算机可读存储介质 | |
WO2020000717A1 (zh) | 网页分类方法、装置及计算机可读存储介质 | |
WO2017167067A1 (zh) | 网页文本分类的方法和装置,网页文本识别的方法和装置 | |
WO2020237856A1 (zh) | 基于知识图谱的智能问答方法、装置及计算机存储介质 | |
WO2019041521A1 (zh) | 用户关键词提取装置、方法及计算机可读存储介质 | |
WO2015149533A1 (zh) | 一种基于网页内容分类进行分词处理的方法和装置 | |
US20130073514A1 (en) | Flexible and scalable structured web data extraction | |
CN110390044B (zh) | 一种相似网络页面的搜索方法及设备 | |
CN113051356A (zh) | 开放关系抽取方法、装置、电子设备及存储介质 | |
CN113626607B (zh) | 异常工单识别方法、装置、电子设备及可读存储介质 | |
WO2021068681A1 (zh) | 标签分析方法、装置及计算机可读存储介质 | |
CN112632278A (zh) | 一种基于多标签分类的标注方法、装置、设备及存储介质 | |
CN114021577A (zh) | 内容标签的生成方法、装置、电子设备及存储介质 | |
WO2018171295A1 (zh) | 一种给文章标注标签的方法、装置、终端及计算机可读存储介质 | |
CN114780746A (zh) | 基于知识图谱的文档检索方法及其相关设备 | |
CN109271624B (zh) | 一种目标词确定方法、装置及存储介质 | |
CN113486664A (zh) | 文本数据可视化分析方法、装置、设备及存储介质 | |
CN114416998A (zh) | 文本标签的识别方法、装置、电子设备及存储介质 | |
CN113626704A (zh) | 基于word2vec模型的推荐信息方法、装置及设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18918623 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18918623 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 01.03.2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18918623 Country of ref document: EP Kind code of ref document: A1 |