WO2021196468A1 - 标签建立方法、装置、电子设备及介质 - Google Patents

标签建立方法、装置、电子设备及介质 Download PDF

Info

Publication number
WO2021196468A1
WO2021196468A1 PCT/CN2020/105633 CN2020105633W WO2021196468A1 WO 2021196468 A1 WO2021196468 A1 WO 2021196468A1 CN 2020105633 W CN2020105633 W CN 2020105633W WO 2021196468 A1 WO2021196468 A1 WO 2021196468A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
feature vector
target
feature
probability
Prior art date
Application number
PCT/CN2020/105633
Other languages
English (en)
French (fr)
Inventor
赵焕丽
徐国强
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2021196468A1 publication Critical patent/WO2021196468A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/04Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
    • H04L63/0428Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
    • H04L63/0435Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload wherein the sending and receiving network entities apply symmetric encryption, i.e. same key used for encryption and decryption

Definitions

  • This application relates to the field of data processing technology, and in particular to a method, device, electronic device and medium for establishing a label.
  • news texts have grown explosively, and most of the texts are longer.
  • the text content will be reviewed when reporting or commenting on news events. Characterization, and then screen out useful information.
  • news text covers information from various industries such as entertainment, technology, etc.
  • when manually tagging news text it is necessary to be familiar with the proper nouns of various industries, which will affect the efficiency of label establishment. For this reason, the establishment of news labels is based on The method came into being.
  • the hidden Markov model is used to determine the entity in the text content.
  • the hidden Markov model only considers the current word and the previous word when determining the entity. Words, without considering the impact of the following words on the current word, is not comprehensive enough, which leads to low accuracy of the established label.
  • the first aspect of the present application provides a method for establishing a label, and the method for establishing a label includes:
  • Each first feature vector and each second feature vector are input into a pre-trained target model to obtain at least one target feature vector corresponding to the at least one first feature vector, and each target feature vector is composed of an encoding vector And position vectors are spliced together, each encoding vector is spliced by each first feature vector and each second feature vector, each position vector is based on the position of each first feature vector and each second feature The position of the vector is determined;
  • the label of the news text is determined according to the at least one probability vector.
  • a second aspect of the present application provides an electronic device including a processor and a memory, and the processor is configured to execute computer-readable instructions stored in the memory to implement the following steps:
  • Each first feature vector and each second feature vector are input into a pre-trained target model to obtain at least one target feature vector corresponding to the at least one first feature vector, and each target feature vector is composed of an encoding vector And position vectors are spliced together, each encoding vector is spliced by each first feature vector and each second feature vector, each position vector is based on the position of each first feature vector and each second feature The position of the vector is determined;
  • the label of the news text is determined according to the at least one probability vector.
  • a third aspect of the present application provides a computer-readable storage medium having at least one computer-readable instruction stored thereon, and the at least one computer-readable instruction is executed by a processor to implement the following steps:
  • Each first feature vector and each second feature vector are input into a pre-trained target model to obtain at least one target feature vector corresponding to the at least one first feature vector, and each target feature vector is composed of an encoding vector And position vectors are spliced together, each encoding vector is spliced by each first feature vector and each second feature vector, each position vector is based on the position of each first feature vector and each second feature The position of the vector is determined;
  • the label of the news text is determined according to the at least one probability vector.
  • a fourth aspect of the present application provides a label establishment device, the label establishment device includes:
  • the extracting unit is configured to extract news text from the tagging instruction when the tagging instruction is received;
  • a preprocessing unit configured to preprocess the news text to obtain at least one word segmentation
  • An encoding unit configured to encode the at least one word segmentation to obtain at least one first feature vector corresponding to the at least one word segmentation
  • the extraction unit is further configured to perform context feature extraction on each first feature vector in the at least one first feature vector to obtain a second feature vector corresponding to each first feature vector;
  • the input unit is configured to input each first feature vector and each second feature vector into a pre-trained target model to obtain at least one target feature vector corresponding to the at least one first feature vector, and each target feature
  • the vector is formed by concatenating the coding vector and the position vector.
  • Each coding vector is formed by concatenating each first feature vector and each second feature vector.
  • Each position vector is based on the location and location of each first feature vector. The position of each second feature vector is determined;
  • a processing unit configured to perform mapping processing on the at least one target feature vector to obtain at least one probability vector
  • the determining unit is configured to determine the label of the news text according to the at least one probability vector.
  • this application can obtain an accurate target feature vector by fusing the first feature vector and the second feature vector, thereby improving the accuracy of the tag.
  • the determination not only facilitates the user to filter out news texts with certain tags, but also enables the user to understand the content of the news text before reading the news text.
  • Fig. 1 is a flow chart of a preferred embodiment of a label establishment method disclosed in the present application.
  • Fig. 2 is a functional module diagram of a preferred embodiment of a label establishment device disclosed in the present application.
  • FIG. 3 is a schematic structural diagram of an electronic device according to a preferred embodiment of the method for establishing a label according to the present application.
  • FIG. 1 it is a flowchart of a preferred embodiment of the label establishment method of the present application. According to different needs, the order of the steps in the flowchart can be changed, and some steps can be omitted.
  • the label establishment method is applied to one or more electronic devices.
  • the electronic device is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. Its hardware includes but is not limited to Microprocessor, Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Processor
  • embedded equipment etc.
  • the electronic device may be any electronic product that can perform human-computer interaction with the user, such as a personal computer, a tablet computer, a smart phone, a personal digital assistant (PDA), a game console, an interactive network television ( Internet Protocol Television, IPTV), smart wearable devices, etc.
  • a personal computer a tablet computer
  • a smart phone a personal digital assistant (PDA)
  • PDA personal digital assistant
  • IPTV interactive network television
  • smart wearable devices etc.
  • the electronic device may also include a network device and/or user equipment.
  • the network device includes, but is not limited to, a single network electronic device, an electronic device group composed of multiple network electronic devices, or a cloud based on cloud computing (Cloud Computing) composed of a large number of hosts or network electronic devices.
  • the network where the electronic device is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), etc.
  • the content in the tagging instruction may include, but is not limited to: the news text, the text number of the news text, and the like.
  • the field to which the news text belongs may include, but is not limited to: entertainment, education, technology, etc.
  • the tagging instruction can be triggered by the user (for example, triggered by a preset function button), or it can be triggered automatically within a preset time, which is not limited by this application.
  • the preset time may be a time point (for example, nine o'clock every morning), or a time period.
  • the electronic device determines a target tag from the tagging instruction, and further, the electronic device extracts the target tag from all the information carried by the tagging instruction.
  • the corresponding information is used as the news text.
  • the target tag is a tag corresponding to the news text in the tagging instruction.
  • the labeling instruction A is "label 1: text number 200; label 2: University A established in 1881 is a world-renowned university", the electronic device determines that the target label is label 2, further, the electronic device Extract from “Label 1: Text number 200; Label 2: University A established in 1881 is a world-renowned university” corresponding to the label 2 "University A established in 1881 is a world-renowned university", as The news text.
  • the at least one word segmentation refers to a word segmentation after the news text is segmented, and in addition, the at least one word segmentation may include a TOKEN tag.
  • the TOKEN tag includes, but is not limited to: time, contact number, website address, other numbers, etc.
  • the electronic device preprocessing the news text to obtain at least one word segmentation includes:
  • the electronic device filters the configuration characters in the news text to obtain the first text. Further, the electronic device performs lexical analysis processing on the preset fields in the first text to obtain the second text.
  • the dictionary segments the second text to obtain segmentation positions, and the electronic device constructs a directed acyclic graph (Directed acyclic graph, DAG) according to the second text and the segmentation positions, and further The electronic device calculates the probability of each path in the directed acyclic graph according to the weight in the preset dictionary, and determines the segmentation position corresponding to the path with the highest probability as the target segmentation position, and the electronic device The device determines the at least one word segmentation according to the target segmentation position.
  • DAG directed acyclic graph
  • configuration characters include, but are not limited to: emoticons, symbol patterns, etc.
  • preset fields include, but are not limited to: time, contact information, website address, and so on.
  • the preset dictionary stores at least one user-defined word and a weight corresponding to each user-defined word, wherein the at least one user-defined word may include, but is not limited to: network new words and the like.
  • the electronic device performs lexical analysis processing on preset fields in the first text to obtain the second text including:
  • the electronic device substitutes the TOKEN tag for a preset field in the first text to obtain the second text.
  • the electronic device uses shallow semantic analysis technology to determine the type to which the preset field belongs. Further, the electronic device determines an identifier matching the type from the TOKEN tag according to the type, so The electronic device uses the identifier to replace a preset field in the first text to obtain the second text.
  • the at least one first feature vector refers to a vector obtained by encoding the at least one word segmentation.
  • the electronic device may use one-hot encoding to encode the at least one word segmentation.
  • the electronic device adopts a binary code to represent each participle in the at least one participle, wherein only one bit in the binary code representing each participle is 1 and the other bits are all 0.
  • the participle "UK” is represented by the binary code "001"
  • the participle "university” is represented by the binary code "010”
  • the participle "world” is represented by the binary code "100”.
  • the dimension of the first feature vector after encoding is 3.
  • “Today” is coded as 001
  • "Weather” is coded as 010
  • “Really good” is coded as 100.
  • the number of words for one-hot encoding is 5, and the dimension of the first feature vector after encoding is 5.
  • the code of "now” is 00001
  • the code of "day” is 00010
  • the code of "true” is 00100
  • the code of "clear” is 01000
  • the code of "lang” is 10000.
  • using one-hot encoding to convert the at least one word segmentation into the at least one first feature vector can not only ensure the uniqueness between the at least one word segmentation and the at least one first feature vector , And using the first feature vector to represent the at least one word segmentation is relatively intuitive.
  • converting the at least one word segmentation into the at least one first feature vector facilitates subsequent contextual features for each first feature vector extract.
  • S13 Perform context feature extraction on each first feature vector in the at least one first feature vector to obtain a second feature vector corresponding to each first feature vector.
  • the second feature vector represents a contextual semantic vector of the first feature vector.
  • the electronic device performs context feature extraction on each first feature vector in the at least one first feature vector, and obtaining a second feature vector corresponding to each first feature vector includes :
  • the electronic device receives the number of configured vectors, and for each first feature vector, the electronic device determines the context feature vector set corresponding to the first feature vector according to the at least one first feature vector and the number of vectors Further, the electronic device multiplies each feature vector in the context feature vector set with a first preset matrix, and calculates the average value of the multiplied vectors to obtain an intermediate vector, and the electronic device The intermediate vector is multiplied by a second preset matrix to obtain a target matrix. Each column of the vector in the target matrix represents the vector corresponding to each word. Furthermore, the electronic device uses an activation function to calculate each The probability of each word, and the vector corresponding to the word with the highest probability is determined as the second feature vector.
  • the number of vectors can be configured according to user requirements, and this application does not limit the value of the number of vectors.
  • the value of the first preset matrix and the value of the second preset matrix are obtained by repeatedly training the data in the corpus, and the specific training method is the prior art, which will not be repeated in this application.
  • each first feature vector and each second feature vector into a pre-trained target model to obtain at least one target feature vector corresponding to the at least one first feature vector, and each target feature vector is determined by
  • the encoding vector and the position vector are spliced together.
  • Each encoding vector is spliced from each first feature vector and each second feature vector.
  • Each position vector is based on the position of each first feature vector and each second feature vector. The location of the second feature vector is determined.
  • the at least one target feature vector is obtained by performing fusion processing on the first feature vector and the second feature vector.
  • the target feature vector corresponding to each first feature vector is obtained, The method also includes:
  • the electronic device uses the web crawler technology to obtain historical data. Further, the electronic device inputs the historical data to the forgetting gate layer for forgetting processing to obtain training data, and each training data in the training data includes a first input Vector, a second input vector, and a known output vector. Furthermore, the electronic device uses a cross-validation method to divide the training data into a training set and a validation set, and is based on the first input vector and the first input vector in the training set. Two input vectors and known output vectors are trained to obtain a learner.
  • the electronic device inputs the first input vector and the second input vector in the verification set into the learner to obtain the output vector to be tested, and compare The output vector to be tested and the known output vector, and further, the electronic device, when the output vector to be tested and the known output vector are inconsistent, according to the first input vector in the verification set, The second input vector and the known output vector adjust the learner according to the data in the verification set to obtain the target model.
  • the electronic device adjusting the learner according to the data in the verification set to obtain the target model includes:
  • the electronic device uses a hyperparameter grid search method to determine the optimal hyperparameter point from the data in the verification set. Further, the electronic device adjusts the learner through the optimal hyperparameter point to obtain the optimal hyperparameter point.
  • the target model is a hyperparameter grid search method.
  • the electronic device splits the verification set according to a fixed step size to obtain a target subset, traverses the data at the two ends of the target subset, and verifies the learning through the data at the two ends.
  • the step length is a preset step length, that is, the obtained hyperparameter point is the optimal hyperparameter point, and further, the electronic device adjusts the learner according to the optimal hyperparameter point to obtain the target Model.
  • this application does not limit the preset step length.
  • an accurate target model can be obtained, paving the way for obtaining accurate target feature vectors.
  • the electronic device inputs each first feature vector and each second feature vector into a pre-trained target model, and the obtained target feature vector passes through the first feature vector And the second feature vector for fusion processing, in addition, because the second feature vector represents the context semantic vector of the first feature vector, the target feature vector has context semantics, and thus can be accurately determined The target feature vector.
  • the word “all very good” can be the name of a TV series or other meanings. Without the contextual meaning, it is impossible to accurately determine whether "all very good” belongs to the title or other meanings.
  • S15 Perform mapping processing on the at least one target feature vector to obtain at least one probability vector.
  • the at least one probability vector refers to the probability corresponding to the at least one target feature vector, and each probability vector has N dimensions, where N is a positive integer greater than or equal to 2.
  • the sum of the probabilities of all dimensions in each probability vector is 1.
  • that the electronic device performs mapping processing on the at least one target feature vector to obtain at least one probability vector includes:
  • the electronic device multiplies the at least one target feature vector by a preset weight matrix and adds a preset bias value to obtain at least one score vector, and further, the electronic device performs processing on the at least one score vector
  • the normalization process is performed to obtain the at least one probability vector.
  • the value of the preset weight matrix and the value of the preset bias value are obtained through repeated training, which is not limited in this application.
  • S16 Determine the label of the news text according to the at least one probability vector.
  • the electronic device determining the label of the news text according to the at least one probability vector includes:
  • the electronic device determines the target field to which the news text belongs from the tagging instruction, the information carried in the tagging instruction includes the target field, and further, the electronic device determines from the configuration library that The target dictionary corresponding to the target field, the configuration library stores the mapping relationship between multiple fields and multiple dictionaries, and for the at least one probability vector, the electronic device determines the dimension with the largest probability in each probability vector as The target dimension obtains at least one target dimension of the at least one probability vector, and the electronic device determines a category corresponding to the at least one target dimension in the target dictionary as a label of the news text.
  • the method further includes:
  • the electronic device obtains the text number of the news text from the tagging instruction. Further, the electronic device generates prompt information according to the text number and the label, and the electronic device uses symmetric encryption technology to encrypt the The prompt information is described to obtain the cipher text, and further, the electronic device sends the cipher text to the terminal device of the designated contact.
  • the prompt information can be quickly encrypted, avoiding tampering of the label of the news text, and improving the security of the prompt information.
  • this application can obtain an accurate target feature vector by fusing the first feature vector and the second feature vector, thereby improving the accuracy of the tag.
  • the determination not only facilitates the user to filter out news texts with certain tags, but also enables the user to understand the content of the news text before reading the news text.
  • the label establishment device 11 includes an extraction unit 110, a preprocessing unit 111, an encoding unit 112, an input unit 113, a processing unit 114, a determination unit 115, an acquisition unit 116, a division unit 117, an adjustment unit 118, a generation unit 119, and an encryption unit 120.
  • the module/unit referred to in this application refers to a series of computer program segments that can be executed by the processor 13 and can complete fixed functions, and are stored in the memory 12. In this embodiment, the functions of each module/unit will be described in detail in subsequent embodiments.
  • the extracting unit 110 extracts the news text from the tagging instruction.
  • the content in the tagging instruction may include, but is not limited to: the news text, the text number of the news text, and the like.
  • the field to which the news text belongs may include, but is not limited to: entertainment, education, technology, etc.
  • the tagging instruction can be triggered by the user (for example, triggered by a preset function button), or it can be triggered automatically within a preset time, which is not limited by this application.
  • the preset time may be a time point (for example, nine o'clock every morning), or a time period.
  • the extracting unit 110 determines the target tag from the tagging instruction. Further, the extracting unit 110 extracts the target tag from all the information carried by the tagging instruction. The information corresponding to the target tag is used as the news text.
  • the target tag is a tag corresponding to the news text in the tagging instruction.
  • the labeling instruction A is "label 1: text number 200; label 2: University A established in 1881 is a world-renowned university", the extraction unit 110 determines that the target label is label 2, and further, the extraction Unit 110 extracts from "Label 1: Text number 200; Label 2: University A established in 1881 is a world-famous university” corresponding to the label 2 "University A established in 1881 is a world-famous university” , As the news text.
  • the preprocessing unit 111 preprocesses the news text to obtain at least one word segmentation.
  • the at least one word segmentation refers to a word segmentation after the news text is segmented, and in addition, the at least one word segmentation may include a TOKEN tag.
  • the TOKEN tag includes, but is not limited to: time, contact number, website address, other numbers, etc.
  • the preprocessing unit 111 preprocessing the news text to obtain at least one word segmentation includes:
  • the preprocessing unit 111 filters the configuration characters in the news text to obtain the first text. Further, the preprocessing unit 111 performs lexical analysis processing on the preset fields in the first text to obtain the second text , Segmenting the second text according to a preset dictionary to obtain a segmentation position, and the preprocessing unit 111 constructs a directed acyclic graph (directed acyclic graph, according to the second text and the segmentation position, DAG).
  • the preprocessing unit 111 calculates the probability of each path in the directed acyclic graph according to the weights in the preset dictionary, and determines the segmentation position corresponding to the path with the highest probability as The target segmentation position, and the preprocessing unit 111 determines the at least one word segmentation according to the target segmentation position.
  • configuration characters include, but are not limited to: emoticons, symbol patterns, etc.
  • preset fields include, but are not limited to: time, contact information, website address, and so on.
  • the preset dictionary stores at least one user-defined word and a weight corresponding to each user-defined word, wherein the at least one user-defined word may include, but is not limited to: network new words and the like.
  • the preprocessing unit 111 performs lexical analysis processing on preset fields in the first text to obtain the second text including:
  • the preprocessing unit 111 substitutes the TOKEN tag for a preset field in the first text to obtain the second text.
  • the pre-processing unit 111 uses shallow semantic analysis technology to determine the type to which the preset field belongs. Further, the pre-processing unit 111 determines from the TOKEN tag according to the type the one that matches the type. Identification. The preprocessing unit 111 uses the identification to replace a preset field in the first text to obtain the second text.
  • the encoding unit 112 encodes the at least one word segmentation to obtain at least one first feature vector corresponding to the at least one word segmentation.
  • the at least one first feature vector refers to a vector obtained by encoding the at least one word segmentation.
  • the encoding unit 112 may use one-hot encoding to encode the at least one word segmentation.
  • the encoding unit 112 adopts binary encoding to represent each word segment in the at least one word segmentation, where only one bit in the binary code representing each word segmentation is 1 and the other bits are all 0.
  • the participle "UK” is represented by the binary code "001"
  • the participle "university” is represented by the binary code "010”
  • the participle "world” is represented by the binary code "100”.
  • the dimension of the first feature vector after encoding is 3.
  • “Today” is coded as 001
  • "Weather” is coded as 010
  • “Really good” is coded as 100.
  • the number of words for one-hot encoding is 5, and the dimension of the first feature vector after encoding is 5.
  • the code of "now” is 00001
  • the code of "day” is 00010
  • the code of "true” is 00100
  • the code of "clear” is 01000
  • the code of "lang” is 10000.
  • using one-hot encoding to convert the at least one word segmentation into the at least one first feature vector can not only ensure the uniqueness between the at least one word segmentation and the at least one first feature vector , And using the first feature vector to represent the at least one word segmentation is relatively intuitive.
  • converting the at least one word segmentation into the at least one first feature vector facilitates subsequent contextual features for each first feature vector extract.
  • the extraction unit 110 performs context feature extraction on each first feature vector in the at least one first feature vector to obtain a second feature vector corresponding to each first feature vector.
  • the second feature vector represents a contextual semantic vector of the first feature vector.
  • the extraction unit 110 performs context feature extraction on each first feature vector in the at least one first feature vector to obtain a second feature vector corresponding to each first feature vector include:
  • the extraction unit 110 receives the number of configured vectors, and for each first feature vector, the extraction unit 110 determines the context feature corresponding to the first feature vector according to the at least one first feature vector and the number of vectors Vector set. Further, the extraction unit 110 multiplies each feature vector in the context feature vector set with a first preset matrix, and calculates the average value of the multiplied vector to obtain an intermediate vector. The extraction unit 110 multiplies the intermediate vector points by a second preset matrix to obtain a target matrix. Each column of the vector in the target matrix represents the vector corresponding to each word. Further, the extraction unit 110 uses an activation function to calculate the The probability of each word in the target matrix is described, and the vector corresponding to the word with the highest probability is determined as the second feature vector.
  • the number of vectors can be configured according to user requirements, and this application does not limit the value of the number of vectors.
  • the value of the first preset matrix and the value of the second preset matrix are obtained by repeatedly training the data in the corpus, and the specific training method is the prior art, which will not be repeated in this application.
  • the input unit 113 inputs each first feature vector and each second feature vector into the pre-trained target model to obtain at least one target feature vector corresponding to the at least one first feature vector, and each target feature vector is It is formed by splicing encoding vectors and position vectors. Each encoding vector is spliced from each first feature vector and each second feature vector. Each position vector is based on the position of each first feature vector and each The position of the second feature vector is determined.
  • the at least one target feature vector is obtained by performing fusion processing on the first feature vector and the second feature vector.
  • the The unit 116 uses web crawler technology to obtain historical data. Further, the processing unit 114 inputs the historical data to the forgetting gate layer for forgetting processing to obtain training data. Each training data in the training data includes a first input vector and a first input vector. Two input vectors and a known output vector. Furthermore, the dividing unit 117 uses a cross-validation method to divide the training data into a training set and a verification set. The training unit 122 is based on the first input vector and the second input in the training set. Vectors and known output vectors are trained to obtain a learner.
  • the input unit 113 inputs the first input vector and the second input vector in the verification set to the learner to obtain the output vector to be tested, and compare all The output vector to be tested and the known output vector, and further, when the output vector to be tested and the known output vector are inconsistent, the adjustment unit 118 according to the first input vector and the second input vector in the verification set The input vector and the known output vector are adjusted to obtain the target model.
  • the adjustment unit 118 adjusting the learner according to the data in the verification set to obtain the target model includes:
  • the adjustment unit 118 uses a hyperparameter grid search method to determine optimal hyperparameter points from the data in the verification set. Further, the adjustment unit 118 adjusts the learner through the optimal hyperparameter points, Obtain the target model.
  • the adjustment unit 118 splits the verification set according to a fixed step to obtain a target subset, traverses the data at the two ends of the target subset, and verifies the data with the data at the two ends.
  • the learner obtains the learning rate of each data, determines the data with the best learning rate as the first hyperparameter point, and in the neighborhood of the first hyperparameter point, reduces the step size and continues to traverse until
  • the step size is a preset step size, that is, the obtained hyperparameter point is the optimal hyperparameter point.
  • the adjustment unit 118 adjusts the learner according to the optimal hyperparameter point to obtain the optimal hyperparameter point.
  • the target model is a preset step size, that is, the obtained hyperparameter point is the optimal hyperparameter point.
  • this application does not limit the preset step length.
  • an accurate target model can be obtained, paving the way for obtaining accurate target feature vectors.
  • the input unit 113 inputs each first feature vector and each second feature vector into a pre-trained target model, and the obtained target feature vector is obtained through the first feature vector.
  • the vector and the second feature vector are fused.
  • the target feature vector has context semantics, and thus can accurately Determine the target feature vector.
  • the word “all very good” can be the name of a TV series or other meanings. Without the contextual meaning, it is impossible to accurately determine whether "all very good” belongs to the title or other meanings.
  • the processing unit 114 performs mapping processing on the at least one target feature vector to obtain at least one probability vector.
  • the at least one probability vector refers to the probability corresponding to the at least one target feature vector, and each probability vector has N dimensions, where N is a positive integer greater than or equal to 2.
  • the sum of the probabilities of all dimensions in each probability vector is 1.
  • processing unit 114 performs mapping processing on the at least one target feature vector to obtain at least one probability vector includes:
  • the processing unit 114 respectively multiplies the at least one target feature vector by a preset weight matrix and adds a preset offset value to obtain at least one score vector. Further, the processing unit 114 evaluates the at least one score The vector is normalized to obtain the at least one probability vector.
  • the value of the preset weight matrix and the value of the preset bias value are obtained through repeated training, which is not limited in this application.
  • the determining unit 115 determines the label of the news text according to the at least one probability vector.
  • the determining unit 115 determining the label of the news text according to the at least one probability vector includes:
  • the determining unit 115 determines the target field to which the news text belongs from the tagging instruction, and the information carried in the tagging instruction includes the target field. Further, the determining unit 115 obtains information from a configuration library. Determine the target dictionary corresponding to the target domain, and store the mapping relationship between multiple domains and multiple dictionaries in the configuration library. For the at least one probability vector, the determining unit 115 determines the highest probability in each probability vector The dimension is determined as the target dimension, and at least one target dimension of the at least one probability vector is obtained, and the determining unit 115 determines the category corresponding to the at least one target dimension in the target dictionary as the label of the news text.
  • the obtaining unit 116 obtains the text number of the news text from the tagging instruction, and further generates
  • the unit 119 generates prompt information according to the text number and the label
  • the encryption unit 120 uses symmetric encryption technology to encrypt the prompt information to obtain a cipher text.
  • the sending unit 121 sends the cipher text to the designated contact person. Terminal Equipment.
  • the prompt information can be quickly encrypted, avoiding tampering of the label of the news text, and improving the security of the prompt information.
  • the present application can obtain an accurate target feature vector by fusing the first feature vector and the second feature vector, thereby improving the accuracy of the tag.
  • the determination of not only facilitates the user to filter out news texts with certain tags, but also enables the user to understand the content of the news text before reading the news text.
  • FIG. 3 it is a schematic structural diagram of an electronic device according to a preferred embodiment of the method for establishing a label according to the present application.
  • the electronic device 1 includes, but is not limited to, a memory 12, a processor 13, and a computer program stored in the memory 12 and running on the processor 13, such as Label establishment procedure.
  • the schematic diagram is only an example of the electronic device 1 and does not constitute a limitation on the electronic device 1.
  • the electronic device 1 may also include an input/output device, a network access device, a bus, and the like.
  • the processor 13 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc.
  • the processor 13 is the computing core and control center of the electronic device 1 and connects the entire electronic device with various interfaces and lines. Each part of 1, and executes the operating system of the electronic device 1, and various installed applications, program codes, etc.
  • the processor 13 executes the operating system of the electronic device 1 and various installed applications.
  • the processor 13 executes the application program to implement the steps in the foregoing label establishment method embodiments, for example, the steps shown in FIG. 1.
  • the computer program may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 12 and executed by the processor 13 to complete the present invention.
  • the one or more modules/units may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer program in the electronic device 1.
  • the computer program can be divided into an extraction unit 110, a preprocessing unit 111, an encoding unit 112, an input unit 113, a processing unit 114, a determination unit 115, an acquisition unit 116, a division unit 117, an adjustment unit 118, and a generation unit 119.
  • the encryption unit 120, the sending unit 121, and the training unit 122 are included in the computer program.
  • the memory 12 may be used to store the computer program and/or module.
  • the processor 13 runs or executes the computer program and/or module stored in the memory 12 and calls data stored in the memory 12, The various functions of the electronic device 1 are realized.
  • the memory 12 may mainly include a storage program area and a storage data area.
  • the storage program area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may Stores data, etc. created based on the use of electronic devices.
  • the memory 12 may include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a Secure Digital (SD) card, a flash memory card (Flash Card), At least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device.
  • non-volatile memory such as a hard disk, a memory, a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a Secure Digital (SD) card, a flash memory card (Flash Card), At least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device.
  • the computer-readable storage medium stores computer-readable instructions, wherein the processor executes the computer-readable instructions to implement the following steps: when a labeling instruction is received, from the labeling instruction Extracting news text; preprocessing the news text to obtain at least one word segmentation; encoding the at least one word segmentation to obtain at least one first feature vector corresponding to the at least one word segmentation; Perform context feature extraction for each first feature vector in the feature vector to obtain the second feature vector corresponding to each first feature vector; input each first feature vector and each second feature vector to the pre-trained target model In the process, at least one target feature vector corresponding to the at least one first feature vector is obtained, each target feature vector is formed by concatenating an encoding vector and a position vector, and each encoding vector is composed of each first feature vector and each The second feature vectors are spliced together, and each position vector is determined according to the location of each first feature vector and the location of each second feature vector; the at least one target feature vector is mapped to obtain at least one Probability vector; the
  • the at least one computer readable instruction is executed by the processor to implement the following steps: filter configuration characters in the news text to obtain the first Text; performing lexical analysis processing on the preset fields in the first text to obtain the second text; segmenting the second text according to the preset dictionary to obtain the segmentation position; according to the second text and the The segmentation position is described, and a directed acyclic graph is constructed; the probability of each path in the directed acyclic graph is calculated according to the weights in the preset dictionary; the segmentation position corresponding to the path with the highest probability is determined as the target Segmentation position: The at least one word segmentation is determined according to the target segmentation position.
  • the at least one computer-readable instruction It is executed by the processor to implement the following steps: receiving the number of configured vectors; for each first feature vector, determining the context feature vector corresponding to the first feature vector according to the at least one first feature vector and the number of vectors Set; multiply each feature vector in the context feature vector set with a first preset matrix, and calculate the average value of the multiplied vector to obtain an intermediate vector; multiply the intermediate vector by a second preset Matrix to obtain the target matrix, each column vector in the target matrix represents the vector corresponding to each word; the activation function is used to calculate the probability of each word in the target matrix; the vector corresponding to the word with the highest probability is determined as the first Two eigenvectors.
  • the at least one computer The readable instructions executed by the processor are also used to implement the following steps: use web crawler technology to obtain historical data; input the historical data into the forgetting gate layer for forgetting processing to obtain training data, each of the training data includes The first input vector, the second input vector, and the known output vector; the training data is divided into a training set and a validation set using a cross-validation method; based on the first input vector, the second input vector and the known output vector in the training set The output vector is trained to obtain a learner; the first input vector and the second input vector in the verification set are input into the learner to obtain the output vector to be tested, and the output vector to be tested is compared with the already tested output vector. Known output vector; when the output vector to be tested and the known output vector are inconsistent, adjust the learner according to the first input vector, the second input vector and the known
  • the at least one computer-readable instruction is further used to implement the following step when the at least one computer-readable instruction is executed by the processor:
  • the eigenvectors are respectively multiplied by a preset weight matrix and added with a preset offset value to obtain at least one score vector; normalization processing is performed on the at least one score vector to obtain the at least one probability vector.
  • the at least one computer readable instruction is executed by a processor to implement the following step: determining the news text from the tagging instruction Belongs to the target domain, the information carried in the tagging instruction includes the target domain; the target dictionary corresponding to the target domain is determined from the configuration library, and the configuration library stores the information of multiple domains and multiple dictionaries Mapping relationship; for the at least one probability vector, the dimension with the largest probability in each probability vector is determined as the target dimension, and at least one target dimension of the at least one probability vector is obtained; and the at least one target dimension is in the target The corresponding category in the dictionary is determined as the label of the news text.
  • the memory 12 may be an external memory and/or an internal memory of the electronic device 1. Further, the memory 12 may be a memory in a physical form, such as a memory stick, a TF card (Trans-flash Card), and so on.
  • TF card Trans-flash Card
  • the integrated module/unit of the electronic device 1 may be stored in a computer-readable storage medium, which may be non-easy.
  • a volatile storage medium can also be a volatile storage medium.
  • the computer program includes computer program code
  • the computer program code may be in the form of source code, object code, executable file, or some intermediate forms.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) .
  • the memory 12 in the electronic device 1 stores multiple instructions to implement a label establishment method, and the processor 13 can execute the multiple instructions to realize: when a labeling instruction is received, Extracting news text from the tagging instruction; preprocessing the news text to obtain at least one word segmentation; encoding the at least one word segmentation to obtain at least one first feature vector corresponding to the at least one word segmentation; Perform context feature extraction on each first feature vector in the at least one first feature vector to obtain a second feature vector corresponding to each first feature vector; combine each first feature vector and each second feature vector Input into the pre-trained target model to obtain at least one target feature vector corresponding to the at least one first feature vector.
  • Each target feature vector is formed by splicing a coding vector and a position vector, and each coding vector is composed of each The first feature vector and each second feature vector are spliced together, and each position vector is determined according to the position of each first feature vector and the position of each second feature vector; for the at least one target feature vector Perform mapping processing to obtain at least one probability vector; and determine the label of the news text according to the at least one probability vector.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional modules in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种标签建立方法、装置、电子设备及介质。上述方法包括:当接收到打标签指令时,从所述打标签指令中提取新闻文本(S10),对所述新闻文本进行预处理,得到至少一个分词(S11),对所述至少一个分词进行编码,得到与所述至少一个分词对应的至少一个第一特征向量(S12),对所述至少一个第一特征向量中每个第一特征向量进行上下文特征提取,得到与每个第一特征向量对应的第二特征向量(S13),将每个第一特征向量及每个第二特征向量输入至预先训练的目标模型中,得到目标特征向量,对目标特征向量进行映射处理,得到概率向量,并确定新闻文本的标签,通过将第一特征向量及第二特征向量进行融合处理,能够得到准确的目标特征向量,进而提高了标签的准确率,另外,通过标签的确定,不仅便于用户筛选出具有某些标签的新闻文本,还能使用户在阅读所述新闻文本之前了解新闻文本的内容。

Description

标签建立方法、装置、电子设备及介质
本申请要求于2020年03月31日提交中国专利局,申请号为202010243203.9,发明名称为“标签建立方法、装置、电子设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术领域,尤其涉及一种标签建立方法、装置、电子设备及介质。
背景技术
随着信息网络的发展,新闻文本***式地增长,且文本大多篇幅较长,为了方便读者在阅读之前就可以大致了解文本的内容,在对新闻事件进行报导或者评论时,会对文本内容进行表征,进而筛选出有用的信息。由于新闻文本涵盖了娱乐、科技等各行各业的信息,因此,人工对新闻文本进行打标签时需要熟知各行各业的专有名词,进而影响标签建立的效率,为此,基于新闻标签的建立方法应运而生。
发明人意识到,在现有的基于新闻标签的建立方法中,采用隐马尔科夫模型对文本内容中的实体进行确定,然而,隐马尔科夫模型在确定实体时只考虑当前词及前面的词,而没有考虑后面的词对当前词带来的影响,不够全面,进而导致建立的标签准确率低。
因此,如何构建准确的新闻标签建立方案,成了有待解决的技术问题。
发明内容
鉴于以上内容,有必要提供一种标签建立方法、装置、电子设备及介质,能够提高标签的准确率。
本申请的第一方面提供一种标签建立方法,所述标签建立方法包括:
当接收到打标签指令时,从所述打标签指令中提取新闻文本;
对所述新闻文本进行预处理,得到至少一个分词;
对所述至少一个分词进行编码,得到与所述至少一个分词对应的至少一个第一特征向量;
对所述至少一个第一特征向量中每个第一特征向量进行上下文特征提取,得到与每个第一特征向量对应的第二特征向量;
将每个第一特征向量及每个第二特征向量输入至预先训练的目标模型中,得到与所述至少一个第一特征向量对应的至少一个目标特征向量,每个目标特征向量是由编码向量及位置向量拼接而成,每个编码向量是由每个第一特征向量及每个第二特征向量拼接而成,每个位置向量是根据每个第一特征向量的位置及每个第二特征向量的位置确定的;
对所述至少一个目标特征向量进行映射处理,得到至少一个概率向量;
根据所述至少一个概率向量确定所述新闻文本的标签。
本申请的第二方面提供一种电子设备,所述电子设备包括处理器和存储器,所述处理器用于执行所述存储器中存储的计算机可读指令以实现以下步骤:
当接收到打标签指令时,从所述打标签指令中提取新闻文本;
对所述新闻文本进行预处理,得到至少一个分词;
对所述至少一个分词进行编码,得到与所述至少一个分词对应的至少一个第一特征向量;
对所述至少一个第一特征向量中每个第一特征向量进行上下文特征提取,得到与每个第一特征向量对应的第二特征向量;
将每个第一特征向量及每个第二特征向量输入至预先训练的目标模型中,得到与所述至少一个第一特征向量对应的至少一个目标特征向量,每个目标特征向量是由编码向量及位置向量拼接而成,每个编码向量是由每个第一特征向量及每个第二特征向量拼接而成,每个位置向量是根据每个第一特征向量的位置及每个第二特征向量的位置确定的;
对所述至少一个目标特征向量进行映射处理,得到至少一个概率向量;
根据所述至少一个概率向量确定所述新闻文本的标签。
本申请的第三方面提供一种计算机可读存储介质,所述计算机可读存储介质上存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理器执行以实现以下步骤:
当接收到打标签指令时,从所述打标签指令中提取新闻文本;
对所述新闻文本进行预处理,得到至少一个分词;
对所述至少一个分词进行编码,得到与所述至少一个分词对应的至少一个第一特征向量;
对所述至少一个第一特征向量中每个第一特征向量进行上下文特征提取,得到与每个第一特征向量对应的第二特征向量;
将每个第一特征向量及每个第二特征向量输入至预先训练的目标模型中,得到与所述至少一个第一特征向量对应的至少一个目标特征向量,每个目标特征向量是由编码向量及位置向量拼接而成,每个编码向量是由每个第一特征向量及每个第二特征向量拼接而成,每个位置向量是根据每个第一特征向量的位置及每个第二特征向量的位置确定的;
对所述至少一个目标特征向量进行映射处理,得到至少一个概率向量;
根据所述至少一个概率向量确定所述新闻文本的标签。
本申请的第四方面提供一种标签建立装置,所述标签建立装置包括:
提取单元,用于当接收到打标签指令时,从所述打标签指令中提取新闻文本;
预处理单元,用于对所述新闻文本进行预处理,得到至少一个分词;
编码单元,用于对所述至少一个分词进行编码,得到与所述至少一个分词对应的至少一个第一特征向量;
所述提取单元,还用于对所述至少一个第一特征向量中每个第一特征向量进行上下文特征提取,得到与每个第一特征向量对应的第二特征向量;
输入单元,用于将每个第一特征向量及每个第二特征向量输入至预先训练的目标模型中,得到与所述至少一个第一特征向量对应的至少一个目标特征向量,每个目标特征向量是由编码向量及位置向量拼接而成,每个编码向量是由每个第一特征向量及每个第二特征向量拼接而成,每个位置向量是根据每个第一特征向量的位置及每个第二特征向量的位置确定的;
处理单元,用于对所述至少一个目标特征向量进行映射处理,得到至少一个概率向量;
确定单元,用于根据所述至少一个概率向量确定所述新闻文本的标签。
由以上技术方案可以看出,本申请通过将第一特征向量及第二特征向量进行融合处理,能够得到准确的目标特征向量,进而提高了所述标签的准确率,另外,通过所述标签的确定,不仅便于用户筛选出具有某些标签的新闻文本,还能使用户在阅读所述新闻文本之前了解所述新闻文本的内容。
附图说明
图1是本申请公开的一种标签建立方法的较佳实施例的流程图。
图2是本申请公开的一种标签建立装置的较佳实施例的功能模块图。
图3是本申请实现标签建立方法的较佳实施例的电子设备的结构示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面结合附图和具体实施例对本申请进行详细描述。
如图1所示,是本申请标签建立方法的较佳实施例的流程图。根据不同的需求,该流程图中步骤的顺序可以改变,某些步骤可以省略。
所述标签建立方法应用于一个或者多个电子设备中,所述电子设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
所述电子设备可以是任何一种可与用户进行人机交互的电子产品,例如,个人计算机、平板电脑、智能手机、个人数字助理(Personal Digital Assistant,PDA)、游戏机、交互式网络电视(Internet Protocol Television,IPTV)、智能式穿戴式设备等。
所述电子设备还可以包括网络设备和/或用户设备。其中,所述网络设备包括,但不限于单个网络电子设备、多个网络电子设备组成的电子设备组或基于云计算(Cloud Computing)的由大量主机或网络电子设备构成的云。
所述电子设备所处的网络包括但不限于互联网、广域网、城域网、局域网、虚拟专用网络(Virtual Private Network,VPN)等。
S10,当接收到打标签指令时,从所述打标签指令中提取新闻文本。
在本申请的至少一个实施例中,所述打标签指令中的内容可以包括,但不限于:所述新闻文本、所述新闻文本的文本编号等。所述新闻文本所属的领域可以包括,但不限于:娱乐、教育、科技等。
在本申请的至少一个实施例中,所述打标签指令可以由用户触发(例如:通过预设功能按键进行触发),也可以在预设时间内自动触发,本申请不作限制。
其中,所述预设时间可以是个时间点(例如:每天早上九点),也可以是个时间段。
在本申请的至少一个实施例中,所述电子设备从所述打标签指令中确定目标标签,进一步地,所述电子设备从所述打标签指令所携带的所有信息中提取与所述目标标签对应的信息,作为所述新闻文本。
其中,所述目标标签为所述新闻文本在所述打标签指令中相应的标签。
例如:打标签指令A为“标签1:文本编号200;标签2:成立于1881年的甲大学是世界著名的大学”,所述电子设备确定目标标签为标签2,进一步地,所述电子设备从“标签1:文本编号200;标签2:成立于1881年的甲大学是世界著名的大学”中提取与所述标签2对应的“成立于1881年的甲大学是世界著名的大学”,作为所述新闻文本。
S11,对所述新闻文本进行预处理,得到至少一个分词。
在本申请的至少一个实施例中,所述至少一个分词是指对所述新闻文本切分后的分词,另外,所述至少一个分词可以包括TOKEN标签。
其中,所述TOKEN标签包括,但不限于:时间、联系电话、网址、其他数字等。
在本申请的至少一个实施例中,所述电子设备对所述新闻文本进行预处理,得到至少一个分词包括:
所述电子设备过滤所述新闻文本中的配置字符,得到第一文本,进一步地,所述电子设备对所述第一文本中的预设字段进行词法分析处理,得到第二文本,根据预设词典对所述第二文本进行切分,得到切分位置,所述电子设备根据所述第二文本及所述切分位置,构建有向无环图(Directed acyclic graph,DAG),更进一步地,所述电子设备根据所述预设词典中 的权值计算所述有向无环图中每条路径的概率,将概率最大的路径对应的切分位置确定为目标切分位置,所述电子设备根据所述目标切分位置确定所述至少一个分词。
其中,所述配置字符包括,但不限于:表情符号、符号图案等。
进一步地,所述预设字段包括,但不限于:时间、联系方式、网址等。
更进一步地,所述预设词典中存储至少一个自定义词及每个自定义词对应的权值,其中,所述至少一个自定义词可以包括,但不限于:网络新词等。
通过过滤所述配置字符,不仅能够节省所述电子设备的内存,还能节省处理所述新闻文本的时间,进而提高打标签的效率;通过对所述预设字段进行词法分析处理,能够避免后续提取上下文特征时发生不必要的扰动;通过具有权值的预设词典切分所述第二文本,能够准确地确定所述至少一个分词。
在本申请的至少一个实施例中,所述电子设备对所述第一文本中的预设字段进行词法分析处理,得到第二文本包括:
所述电子设备通过所述TOKEN标签替代所述第一文本中的预设字段,得到所述第二文本。
具体地,所述电子设备采用浅层语义分析技术确定所述预设字段所属的类型,进一步地,所述电子设备根据所述类型从所述TOKEN标签中确定与所述类型匹配的标识,所述电子设备采用所述标识替代所述第一文本中的预设字段,得到所述第二文本。
S12,对所述至少一个分词进行编码,得到与所述至少一个分词对应的至少一个第一特征向量。
在本申请的至少一个实施例中,所述至少一个第一特征向量是指对所述至少一个分词编码后的向量。
在本申请的至少一个实施例中,所述电子设备可以采用one-hot编码对所述至少一个分词进行编码。
具体地,所述电子设备采用二进制编码表示所述至少一个分词中的每个分词,其中,表示每个分词的二进制编码中只有一位为1,其他位都为0。
例如:分词“英国”用二进制编码“001”表示,分词“大学”用二进制编码“010”表示,分词“世界”用二进制编码“100”表示。
当然,进行one-hot编码的词数量越多,所述至少一个第一特征向量的维度越高。
例如:进行one-hot编码的词数量为3个,则编码后的第一特征向量的维度为3。如:“今天”编码为001、“天气”编码为010以及“真好”编码为100。进行one-hot编码的词数量为5个,则编码后的第一特征向量的维度为5。如:“今”编码为00001、“天”编码为00010、“真”编码为00100、“晴”编码为01000、“朗”编码为10000。
通过上述实施方式,采用one-hot编码,将所述至少一个分词转化为所述至少一个第一特征向量,不仅能保证所述至少一个分词与所述至少一个第一特征向量之间的唯一性,而且使用所述第一特征向量表示所述至少一个分词,比较直观,另外,将所述至少一个分词转化为所述至少一个第一特征向量,便于后续对每个第一特征向量进行上下文特征提取。
S13,对所述至少一个第一特征向量中每个第一特征向量进行上下文特征提取,得到与每个第一特征向量对应的第二特征向量。
在本申请的至少一个实施例中,所述第二特征向量表征所述第一特征向量的上下文语义向量。
在本申请的至少一个实施例中,所述电子设备对所述至少一个第一特征向量中每个第一特征向量进行上下文特征提取,得到与每个第一特征向量对应的第二特征向量包括:
所述电子设备接收配置的向量个数,对于每个第一特征向量,所述电子设备根据所述至少一个第一特征向量及所述向量个数确定该第一特征向量对应的上下文特征向量集,进一步 地,所述电子设备将所述上下文特征向量集中每个特征向量分别与第一预设矩阵进行相乘,及计算相乘后的向量的平均值,得到中间向量,所述电子设备将所述中间向量点乘第二预设矩阵,得到目标矩阵,所述目标矩阵中每列向量表征每个词对应的向量,更进一步地,所述电子设备采用激活函数计算所述目标矩阵中每个词的概率,及将概率最大的词对应的向量确定为所述第二特征向量。
其中,所述向量个数可以根据用户需求配置,本申请对所述向量个数的取值不作限制。所述第一预设矩阵的取值及所述第二预设矩阵的取值是通过反复训练语料库中的数据得来的,具体的训练方式为现有技术,本申请在此不再赘述。
S14,将每个第一特征向量及每个第二特征向量输入至预先训练的目标模型中,得到与所述至少一个第一特征向量对应的至少一个目标特征向量,每个目标特征向量是由编码向量及位置向量拼接而成,每个编码向量是由每个第一特征向量及每个第二特征向量拼接而成,每个位置向量是根据每个第一特征向量的位置及每个第二特征向量的位置确定的。
在本申请的至少一个实施例中,所述至少一个目标特征向量是通过将所述第一特征向量及所述第二特征向量进行融合处理后得到的。
在本申请的至少一个实施例中,在将每个第一特征向量及每个第二特征向量输入至预先训练的目标模型中,得到与每个第一特征向量对应的目标特征向量之前,所述方法还包括:
所述电子设备采用网络爬虫技术获取历史数据,进一步地,所述电子设备将所述历史数据输入到遗忘门层进行遗忘处理,得到训练数据,所述训练数据中每个训练数据包括第一输入向量、第二输入向量及已知输出向量,更进一步地,所述电子设备采用交叉验证法将所述训练数据划分为训练集及验证集,并基于所述训练集中的第一输入向量、第二输入向量及已知输出向量进行训练,得到学习器,所述电子设备将所述验证集中的第一输入向量及第二输入向量输入至所述学习器中,得到待测输出向量,及比较所述待测输出向量及所述已知输出向量,更进一步地,所述电子设备当所述待测输出向量及所述已知输出向量不一致时,根据所述验证集中的第一输入向量、第二输入向量及已知输出向量,根据所述验证集中的数据,调整所述学习器,得到所述目标模型。
在本申请的至少一个实施例中,所述电子设备根据所述验证集中的数据,调整所述学习器,得到所述目标模型包括:
所述电子设备采用超参数网格搜索方法从所述验证集中的数据确定最优超参数点,进一步地,所述电子设备通过所述最优超参数点对所述学习器进行调整,得到所述目标模型。
具体地,所述电子设备将所述验证集按照固定步长进行拆分,得到目标子集,遍历所述目标子集上两端端点的数据,通过所述两端端点的数据验证所述学习器,得到每个数据的学习率,将所述学习率最好的数据确定为第一超参数点,并在所述第一超参数点的邻域内,缩小所述步长继续遍历,直至所述步长为预设步长,即得到的超参数点为所述最优超参数点,更进一步地,所述电子设备根据所述最优超参数点调整所述学习器,得到所述目标模型。
其中,本申请对所述预设步长不作限制。
通过对所述学习器进行调整,能够得到准确的目标模型,为得到准确的目标特征向量作铺垫。
在本申请的至少一个实施例中,所述电子设备将每个第一特征向量及每个第二特征向量输入至预先训练的目标模型中,得到的目标特征向量是经过所述第一特征向量及所述第二特征向量进行融合处理的,另外,由于所述第二特征向量表征所述第一特征向量的上下文语义向量,因此,使所述目标特征向量具有上下文语义,进而能够准确地确定所述目标特征向量。
例如:“都挺好”这个词,可以是一个电视剧的剧名,也可以是其他含义,若没有融合上下文的含义,无法准确确定出“都挺好”属于剧名,还是其他含义。
S15,对所述至少一个目标特征向量进行映射处理,得到至少一个概率向量。
在本申请的至少一个实施例中,所述至少一个概率向量是指所述至少一个目标特征向量对应的概率,每个概率向量有N个维度,其中,N为大于或者等于2的正整数。另外,每个概率向量中所有维度的概率总和为1。
在本申请的至少一个实施例中,所述电子设备对所述至少一个目标特征向量进行映射处理,得到至少一个概率向量包括:
所述电子设备将所述至少一个目标特征向量分别乘以预设权重矩阵、及加上预设偏置值,得到至少一个分数向量,进一步地,所述电子设备对所述至少一个分数向量进行归一化处理,得到所述至少一个概率向量。
其中,所述预设权重矩阵的取值及所述预设偏置值的取值是通过反复训练得到的,本申请在此不作限制。
S16,根据所述至少一个概率向量确定所述新闻文本的标签。
在本申请的至少一个实施例中,所述电子设备根据所述至少一个概率向量确定所述新闻文本的标签包括:
所述电子设备从所述打标签指令中确定所述新闻文本所属的目标领域,所述打标签指令中所携带的信息包括所述目标领域,进一步地,所述电子设备从配置库中确定与所述目标领域对应的目标词典,所述配置库中存储多个领域与多个词典的映射关系,对于所述至少一个概率向量,所述电子设备将每个概率向量中概率最大的维度确定为目标维度,得到所述至少一个概率向量的至少一个目标维度,所述电子设备将所述至少一个目标维度在所述目标词典中对应的类别确定为所述新闻文本的标签。
通过确定所述标签,便于用户筛选出具有某些标签的新闻文本,以满足用户的需求。
在本申请的至少一个实施例中,在根据所述至少一个概率向量确定所述新闻文本的标签后,所述方法还包括:
所述电子设备从所述打标签指令中获取所述新闻文本的文本编号,进一步地,所述电子设备根据所述文本编号及所述标签生成提示信息,所述电子设备采用对称加密技术加密所述提示信息,得到密文,更进一步地,所述电子设备将所述密文发送至指定联系人的终端设备。
通过上述实施方式,能够快速加密所述提示信息,避免所述新闻文本的标签被篡改,提高了所述提示信息的安全性。
由以上技术方案可以看出,本申请通过将第一特征向量及第二特征向量进行融合处理,能够得到准确的目标特征向量,进而提高了所述标签的准确率,另外,通过所述标签的确定,不仅便于用户筛选出具有某些标签的新闻文本,还能使用户在阅读所述新闻文本之前了解所述新闻文本的内容。
以上所述,仅是本申请的具体实施方式,但本申请的保护范围并不局限于此,对于本领域的普通技术人员来说,在不脱离本申请创造构思的前提下,还可以做出改进,但这些均属于本申请的保护范围。
如图2所示,是本申请标签建立装置的较佳实施例的功能模块图。所述标签建立装置11包括提取单元110、预处理单元111、编码单元112、输入单元113、处理单元114、确定单元115、获取单元116、划分单元117、调整单元118、生成单元119、加密单元120、发送单元121及训练单元122。本申请所称的模块/单元是指一种能够被处理器13所执行,并且能够完成固定功能的一系列计算机程序段,其存储在存储器12中。在本实施例中,关于各模块/单元的功能将在后续的实施例中详述。
当接收到打标签指令时,提取单元110从所述打标签指令中提取新闻文本。
在本申请的至少一个实施例中,所述打标签指令中的内容可以包括,但不限于:所述新闻文本、所述新闻文本的文本编号等。所述新闻文本所属的领域可以包括,但不限于:娱乐、教育、科技等。
在本申请的至少一个实施例中,所述打标签指令可以由用户触发(例如:通过预设功能按键进行触发),也可以在预设时间内自动触发,本申请不作限制。
其中,所述预设时间可以是个时间点(例如:每天早上九点),也可以是个时间段。
在本申请的至少一个实施例中,所述提取单元110从所述打标签指令中确定目标标签,进一步地,所述提取单元110从所述打标签指令所携带的所有信息中提取与所述目标标签对应的信息,作为所述新闻文本。
其中,所述目标标签为所述新闻文本在所述打标签指令中相应的标签。
例如:打标签指令A为“标签1:文本编号200;标签2:成立于1881年的甲大学是世界著名的大学”,所述提取单元110确定目标标签为标签2,进一步地,所述提取单元110从“标签1:文本编号200;标签2:成立于1881年的甲大学是世界著名的大学”中提取与所述标签2对应的“成立于1881年的甲大学是世界著名的大学”,作为所述新闻文本。
预处理单元111对所述新闻文本进行预处理,得到至少一个分词。
在本申请的至少一个实施例中,所述至少一个分词是指对所述新闻文本切分后的分词,另外,所述至少一个分词可以包括TOKEN标签。
其中,所述TOKEN标签包括,但不限于:时间、联系电话、网址、其他数字等。
在本申请的至少一个实施例中,所述预处理单元111对所述新闻文本进行预处理,得到至少一个分词包括:
所述预处理单元111过滤所述新闻文本中的配置字符,得到第一文本,进一步地,所述预处理单元111对所述第一文本中的预设字段进行词法分析处理,得到第二文本,根据预设词典对所述第二文本进行切分,得到切分位置,所述预处理单元111根据所述第二文本及所述切分位置,构建有向无环图(Directed acyclic graph,DAG),更进一步地,所述预处理单元111根据所述预设词典中的权值计算所述有向无环图中每条路径的概率,将概率最大的路径对应的切分位置确定为目标切分位置,所述预处理单元111根据所述目标切分位置确定所述至少一个分词。
其中,所述配置字符包括,但不限于:表情符号、符号图案等。
进一步地,所述预设字段包括,但不限于:时间、联系方式、网址等。
更进一步地,所述预设词典中存储至少一个自定义词及每个自定义词对应的权值,其中,所述至少一个自定义词可以包括,但不限于:网络新词等。
通过过滤所述配置字符,不仅能够节省电子设备的内存,还能节省处理所述新闻文本的时间,进而提高打标签的效率;通过对所述预设字段进行词法分析处理,能够避免后续提取上下文特征时发生不必要的扰动;通过具有权值的预设词典切分所述第二文本,能够准确地确定所述至少一个分词。
在本申请的至少一个实施例中,所述预处理单元111对所述第一文本中的预设字段进行词法分析处理,得到第二文本包括:
所述预处理单元111通过所述TOKEN标签替代所述第一文本中的预设字段,得到所述第二文本。
具体地,所述预处理单元111采用浅层语义分析技术确定所述预设字段所属的类型,进一步地,所预处理单元111根据所述类型从所述TOKEN标签中确定与所述类型匹配的标识,所述预处理单元111采用所述标识替代所述第一文本中的预设字段,得到所述第二文本。
编码单元112对所述至少一个分词进行编码,得到与所述至少一个分词对应的至少一个第一特征向量。
在本申请的至少一个实施例中,所述至少一个第一特征向量是指对所述至少一个分词编码后的向量。
在本申请的至少一个实施例中,所述编码单元112可以采用one-hot编码对所述至少一个 分词进行编码。
具体地,所述编码单元112采用二进制编码表示所述至少一个分词中的每个分词,其中,表示每个分词的二进制编码中只有一位为1,其他位都为0。
例如:分词“英国”用二进制编码“001”表示,分词“大学”用二进制编码“010”表示,分词“世界”用二进制编码“100”表示。
当然,进行one-hot编码的词数量越多,所述至少一个第一特征向量的维度越高。
例如:进行one-hot编码的词数量为3个,则编码后的第一特征向量的维度为3。如:“今天”编码为001、“天气”编码为010以及“真好”编码为100。进行one-hot编码的词数量为5个,则编码后的第一特征向量的维度为5。如:“今”编码为00001、“天”编码为00010、“真”编码为00100、“晴”编码为01000、“朗”编码为10000。
通过上述实施方式,采用one-hot编码,将所述至少一个分词转化为所述至少一个第一特征向量,不仅能保证所述至少一个分词与所述至少一个第一特征向量之间的唯一性,而且使用所述第一特征向量表示所述至少一个分词,比较直观,另外,将所述至少一个分词转化为所述至少一个第一特征向量,便于后续对每个第一特征向量进行上下文特征提取。
所述提取单元110对所述至少一个第一特征向量中每个第一特征向量进行上下文特征提取,得到与每个第一特征向量对应的第二特征向量。
在本申请的至少一个实施例中,所述第二特征向量表征所述第一特征向量的上下文语义向量。
在本申请的至少一个实施例中,所述提取单元110对所述至少一个第一特征向量中每个第一特征向量进行上下文特征提取,得到与每个第一特征向量对应的第二特征向量包括:
所述提取单元110接收配置的向量个数,对于每个第一特征向量,所述提取单元110根据所述至少一个第一特征向量及所述向量个数确定该第一特征向量对应的上下文特征向量集,进一步地,所述提取单元110将所述上下文特征向量集中每个特征向量分别与第一预设矩阵进行相乘,并计算相乘后的向量的平均值,得到中间向量,所述提取单元110将所述中间向量点乘第二预设矩阵,得到目标矩阵,所述目标矩阵中每列向量表征每个词对应的向量,更进一步地,所述提取单元110采用激活函数计算所述目标矩阵中每个词的概率,并将概率最大的词对应的向量确定为所述第二特征向量。
其中,所述向量个数可以根据用户需求配置,本申请对所述向量个数的取值不作限制。所述第一预设矩阵的取值及所述第二预设矩阵的取值是通过反复训练语料库中的数据得来的,具体的训练方式为现有技术,本申请在此不再赘述。
输入单元113将每个第一特征向量及每个第二特征向量输入至预先训练的目标模型中,得到与所述至少一个第一特征向量对应的至少一个目标特征向量,每个目标特征向量是由编码向量及位置向量拼接而成,每个编码向量是由每个第一特征向量及每个第二特征向量拼接而成,每个位置向量是根据每个第一特征向量的位置及每个第二特征向量的位置确定的。
在本申请的至少一个实施例中,所述至少一个目标特征向量是通过将所述第一特征向量及所述第二特征向量进行融合处理后得到的。
在本申请的至少一个实施例中,在将每个第一特征向量及每个第二特征向量输入至预先训练的目标模型中,得到与每个第一特征向量对应的目标特征向量之前,获取单元116采用网络爬虫技术获取历史数据,进一步地,处理单元114将所述历史数据输入到遗忘门层进行遗忘处理,得到训练数据,所述训练数据中每个训练数据包括第一输入向量、第二输入向量及已知输出向量,更进一步地,划分单元117采用交叉验证法将所述训练数据划分为训练集及验证集,训练单元122基于所述训练集中的第一输入向量、第二输入向量及已知输出向量进行训练,得到学习器,所述输入单元113将所述验证集中的第一输入向量及第二输入向量输入至所述学习器中,得到待测输出向量,及比较所述待测输出向量及所述已知输出向量, 更进一步地,当所述待测输出向量及所述已知输出向量不一致时,调整单元118根据所述验证集中的第一输入向量、第二输入向量及已知输出向量,调整所述学习器,得到所述目标模型。
在本申请的至少一个实施例中,所述调整单元118根据所述验证集中的数据,调整所述学习器,得到所述目标模型包括:
所述调整单元118采用超参数网格搜索方法从所述验证集中的数据确定最优超参数点,进一步地,所述调整单元118通过所述最优超参数点对所述学习器进行调整,得到所述目标模型。
具体地,所述调整单元118将所述验证集按照固定步长进行拆分,得到目标子集,遍历所述目标子集上两端端点的数据,通过所述两端端点的数据验证所述学习器,得到每个数据的学习率,将所述学习率最好的数据确定为第一超参数点,并在所述第一超参数点的邻域内,缩小所述步长继续遍历,直至所述步长为预设步长,即得到的超参数点为所述最优超参数点,更进一步地,所述调整单元118根据所述最优超参数点调整所述学习器,得到所述目标模型。
其中,本申请对所述预设步长不作限制。
通过对所述学习器进行调整,能够得到准确的目标模型,为得到准确的目标特征向量作铺垫。
在本申请的至少一个实施例中,所述输入单元113将每个第一特征向量及每个第二特征向量输入至预先训练的目标模型中,得到的目标特征向量是经过所述第一特征向量及所述第二特征向量进行融合处理的,另外,由于所述第二特征向量表征所述第一特征向量的上下文语义向量,因此,使所述目标特征向量具有上下文语义,进而能够准确地确定所述目标特征向量。
例如:“都挺好”这个词,可以是一个电视剧的剧名,也可以是其他含义,若没有融合上下文的含义,无法准确确定出“都挺好”属于剧名,还是其他含义。
所述处理单元114对所述至少一个目标特征向量进行映射处理,得到至少一个概率向量。
在本申请的至少一个实施例中,所述至少一个概率向量是指所述至少一个目标特征向量对应的概率,每个概率向量有N个维度,其中,N为大于或者等于2的正整数。另外,每个概率向量中所有维度的概率总和为1。
在本申请的至少一个实施例中,所述处理单元114对所述至少一个目标特征向量进行映射处理,得到至少一个概率向量包括:
所述处理单元114将所述至少一个目标特征向量分别乘以预设权重矩阵、及加上预设偏置值,得到至少一个分数向量,进一步地,所述处理单元114对所述至少一个分数向量进行归一化处理,得到所述至少一个概率向量。
其中,所述预设权重矩阵的取值及所述预设偏置值的取值是通过反复训练得到的,本申请在此不作限制。
确定单元115根据所述至少一个概率向量确定所述新闻文本的标签。
在本申请的至少一个实施例中,所述确定单元115根据所述至少一个概率向量确定所述新闻文本的标签包括:
所述确定单元115从所述打标签指令中确定所述新闻文本所属的目标领域,所述打标签指令中所携带的信息包括所述目标领域,进一步地,所述确定单元115从配置库中确定与所述目标领域对应的目标词典,所述配置库中存储多个领域与多个词典的映射关系,对于所述至少一个概率向量,所述确定单元115将每个概率向量中概率最大的维度确定为目标维度,得到所述至少一个概率向量的至少一个目标维度,所述确定单元115将所述至少一个目标维度在所述目标词典中对应的类别确定为所述新闻文本的标签。
通过确定所述标签,便于用户筛选出具有某些标签的新闻文本,以满足用户的需求。
在本申请的至少一个实施例中,在根据所述至少一个概率向量确定所述新闻文本的标签后,获取单元116从所述打标签指令中获取所述新闻文本的文本编号,进一步地,生成单元119根据所述文本编号及所述标签生成提示信息,加密单元120采用对称加密技术加密所述提示信息,得到密文,更进一步地,发送单元121将所述密文发送至指定联系人的终端设备。
通过上述实施方式,能够快速加密所述提示信息,避免所述新闻文本的标签被篡改,提高了所述提示信息的安全性。
由以上技术方案可以看出,本申请能够通过将第一特征向量及第二特征向量进行融合处理,能够得到准确的目标特征向量,进而提高了所述标签的准确率,另外,通过所述标签的确定,不仅便于用户筛选出具有某些标签的新闻文本,还能使用户在阅读所述新闻文本之前了解所述新闻文本的内容。
如图3所示,是本申请实现标签建立方法的较佳实施例的电子设备的结构示意图。
在本申请的一个实施例中,所述电子设备1包括,但不限于,存储器12、处理器13,以及存储在所述存储器12中并可在所述处理器13上运行的计算机程序,例如标签建立程序。
本领域技术人员可以理解,所述示意图仅仅是电子设备1的示例,并不构成对电子设备1的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述电子设备1还可以包括输入输出设备、网络接入设备、总线等。
所述处理器13可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等,所述处理器13是所述电子设备1的运算核心和控制中心,利用各种接口和线路连接整个电子设备1的各个部分,及执行所述电子设备1的操作***以及安装的各类应用程序、程序代码等。
所述处理器13执行所述电子设备1的操作***以及安装的各类应用程序。所述处理器13执行所述应用程序以实现上述各个标签建立方法实施例中的步骤,例如图1所示的步骤。
示例性的,所述计算机程序可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器12中,并由所述处理器13执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序在所述电子设备1中的执行过程。例如,所述计算机程序可以被分割成提取单元110、预处理单元111、编码单元112、输入单元113、处理单元114、确定单元115、获取单元116、划分单元117、调整单元118、生成单元119、加密单元120、发送单元121及训练单元122。
所述存储器12可用于存储所述计算机程序和/或模块,所述处理器13通过运行或执行存储在所述存储器12内的计算机程序和/或模块,以及调用存储在存储器12内的数据,实现所述电子设备1的各种功能。所述存储器12可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作***、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据电子设备的使用所创建的数据等。此外,存储器12可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。
所述计算机可读存储介质上存储有计算机可读指令,其中,所述处理器执行所述计算机可读指令还用以实现以下步骤:当接收到打标签指令时,从所述打标签指令中提取新闻文本;对所述新闻文本进行预处理,得到至少一个分词;对所述至少一个分词进行编码,得到与所述至少一个分词对应的至少一个第一特征向量;对所述至少一个第一特征向量中每个第一特征向量进行上下文特征提取,得到与每个第一特征向量对应的第二特征向量;将每个第一特 征向量及每个第二特征向量输入至预先训练的目标模型中,得到与所述至少一个第一特征向量对应的至少一个目标特征向量,每个目标特征向量是由编码向量及位置向量拼接而成,每个编码向量是由每个第一特征向量及每个第二特征向量拼接而成,每个位置向量是根据每个第一特征向量的位置及每个第二特征向量的位置确定的;对所述至少一个目标特征向量进行映射处理,得到至少一个概率向量;根据所述至少一个概率向量确定所述新闻文本的标签。
其中,在所述对所述新闻文本进行预处理,得到至少一个分词时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:过滤所述新闻文本中的配置字符,得到第一文本;对所述第一文本中的预设字段进行词法分析处理,得到第二文本;根据预设词典对所述第二文本进行切分,得到切分位置;根据所述第二文本及所述切分位置,构建有向无环图;根据所述预设词典中的权值计算所述有向无环图中每条路径的概率;将概率最大的路径对应的切分位置确定为目标切分位置;根据所述目标切分位置确定所述至少一个分词。
其中,在所述对所述至少一个第一特征向量中每个第一特征向量进行上下文特征提取,得到与每个第一特征向量对应的第二特征向量时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:接收配置的向量个数;对于每个第一特征向量,根据所述至少一个第一特征向量及所述向量个数确定该第一特征向量对应的上下文特征向量集;将所述上下文特征向量集中每个特征向量分别与第一预设矩阵进行相乘、及计算相乘后的向量的平均值,得到中间向量;将所述中间向量点乘第二预设矩阵,得到目标矩阵,所述目标矩阵中每列向量表征每个词对应的向量;采用激活函数计算所述目标矩阵中每个词的概率;将概率最大的词对应的向量确定为所述第二特征向量。
其中,在将每个第一特征向量及每个第二特征向量输入至预先训练的目标模型中,得到与所述至少一个第一特征向量对应的至少一个目标特征向量之前,所述至少一个计算机可读指令被处理器执行还用以实现以下步骤:采用网络爬虫技术获取历史数据;将所述历史数据输入到遗忘门层进行遗忘处理,得到训练数据,所述训练数据中每个训练数据包括第一输入向量、第二输入向量及已知输出向量;采用交叉验证法将所述训练数据划分为训练集及验证集;基于所述训练集中的第一输入向量、第二输入向量及已知输出向量进行训练,得到学习器;将所述验证集中的第一输入向量及第二输入向量输入至所述学习器中,得到待测输出向量,及比较所述待测输出向量及所述已知输出向量;当所述待测输出向量及所述已知输出向量不一致时,根据所述验证集中的第一输入向量、第二输入向量及已知输出向量调整所述学习器,得到所述目标模型。
其中,在所述对所述至少一个目标特征向量进行映射处理,得到至少一个概率向量时,所述至少一个计算机可读指令被处理器执行时还用以实现以下步骤:将所述至少一个目标特征向量分别乘以预设权重矩阵、及加上预设偏置值,得到至少一个分数向量;对所述至少一个分数向量进行归一化处理,得到所述至少一个概率向量。
其中,在所述根据所述至少一个概率向量确定所述新闻文本的标签时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:从所述打标签指令中确定所述新闻文本所属的目标领域,所述打标签指令中所携带的信息包括所述目标领域;从配置库中确定与所述目标领域对应的目标词典,所述配置库中存储多个领域与多个词典的映射关系;对于所述至少一个概率向量,将每个概率向量中概率最大的维度确定为目标维度,得到所述至少一个概率向量的至少一个目标维度;将所述至少一个目标维度在所述目标词典中对应的类别确定为所述新闻文本的标签。
所述存储器12可以是电子设备1的外部存储器和/或内部存储器。进一步地,所述存储器12可以是具有实物形式的存储器,如内存条、TF卡(Trans-flash Card)等等。
所述电子设备1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中,所述计算机可读存储介质可以是非易 失性的存储介质,也可以是易失性的存储介质。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。
其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)。
结合图1,所述电子设备1中的所述存储器12存储多个指令以实现一种标签建立方法,所述处理器13可执行所述多个指令从而实现:当接收到打标签指令时,从所述打标签指令中提取新闻文本;对所述新闻文本进行预处理,得到至少一个分词;对所述至少一个分词进行编码,得到与所述至少一个分词对应的至少一个第一特征向量;对所述至少一个第一特征向量中每个第一特征向量进行上下文特征提取,得到与每个第一特征向量对应的第二特征向量;将每个第一特征向量及每个第二特征向量输入至预先训练的目标模型中,得到与所述至少一个第一特征向量对应的至少一个目标特征向量,每个目标特征向量是由编码向量及位置向量拼接而成,每个编码向量是由每个第一特征向量及每个第二特征向量拼接而成,每个位置向量是根据每个第一特征向量的位置及每个第二特征向量的位置确定的;对所述至少一个目标特征向量进行映射处理,得到至少一个概率向量;根据所述至少一个概率向量确定所述新闻文本的标签。
具体地,所述处理器13对上述指令的具体实现方法可参考图1对应实施例中相关步骤的描述,在此不赘述。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
在本申请所提供的几个实施例中,应该理解到,所揭露的***,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。
因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。
此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。***权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第二等词语用来表示名称,而并不表示任何特定的顺序。
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案 进行修改或等同替换,而不脱离本申请技术方案的精神和范围。

Claims (20)

  1. 一种标签建立方法,其中,所述标签建立方法包括:
    当接收到打标签指令时,从所述打标签指令中提取新闻文本;
    对所述新闻文本进行预处理,得到至少一个分词;
    对所述至少一个分词进行编码,得到与所述至少一个分词对应的至少一个第一特征向量;
    对所述至少一个第一特征向量中每个第一特征向量进行上下文特征提取,得到与每个第一特征向量对应的第二特征向量;
    将每个第一特征向量及每个第二特征向量输入至预先训练的目标模型中,得到与所述至少一个第一特征向量对应的至少一个目标特征向量,每个目标特征向量是由编码向量及位置向量拼接而成,每个编码向量是由每个第一特征向量及每个第二特征向量拼接而成,每个位置向量是根据每个第一特征向量的位置及每个第二特征向量的位置确定的;
    对所述至少一个目标特征向量进行映射处理,得到至少一个概率向量;
    根据所述至少一个概率向量确定所述新闻文本的标签。
  2. 根据权利要求1所述的标签建立方法,其中,所述对所述新闻文本进行预处理,得到至少一个分词包括:
    过滤所述新闻文本中的配置字符,得到第一文本;
    对所述第一文本中的预设字段进行词法分析处理,得到第二文本;
    根据预设词典对所述第二文本进行切分,得到切分位置;
    根据所述第二文本及所述切分位置,构建有向无环图;
    根据所述预设词典中的权值计算所述有向无环图中每条路径的概率;
    将概率最大的路径对应的切分位置确定为目标切分位置;
    根据所述目标切分位置确定所述至少一个分词。
  3. 根据权利要求1所述的标签建立方法,其中,所述对所述至少一个第一特征向量中每个第一特征向量进行上下文特征提取,得到与每个第一特征向量对应的第二特征向量包括:
    接收配置的向量个数;
    对于每个第一特征向量,根据所述至少一个第一特征向量及所述向量个数确定该第一特征向量对应的上下文特征向量集;
    将所述上下文特征向量集中每个特征向量分别与第一预设矩阵进行相乘、及计算相乘后的向量的平均值,得到中间向量;
    将所述中间向量点乘第二预设矩阵,得到目标矩阵,所述目标矩阵中每列向量表征每个词对应的向量;
    采用激活函数计算所述目标矩阵中每个词的概率;
    将概率最大的词对应的向量确定为所述第二特征向量。
  4. 根据权利要求1所述的标签建立方法,其中,在将每个第一特征向量及每个第二特征向量输入至预先训练的目标模型中,得到与所述至少一个第一特征向量对应的至少一个目标特征向量之前,所述标签建立方法还包括:
    采用网络爬虫技术获取历史数据;
    将所述历史数据输入到遗忘门层进行遗忘处理,得到训练数据,所述训练数据中每个训练数据包括第一输入向量、第二输入向量及已知输出向量;
    采用交叉验证法将所述训练数据划分为训练集及验证集;
    基于所述训练集中的第一输入向量、第二输入向量及已知输出向量进行训练,得到学习器;
    将所述验证集中的第一输入向量及第二输入向量输入至所述学习器中,得到待测输出向 量,及比较所述待测输出向量及所述已知输出向量;
    当所述待测输出向量及所述已知输出向量不一致时,根据所述验证集中的第一输入向量、第二输入向量及已知输出向量调整所述学习器,得到所述目标模型。
  5. 根据权利要求1所述的标签建立方法,其中,所述对所述至少一个目标特征向量进行映射处理,得到至少一个概率向量包括:
    将所述至少一个目标特征向量分别乘以预设权重矩阵、及加上预设偏置值,得到至少一个分数向量;
    对所述至少一个分数向量进行归一化处理,得到所述至少一个概率向量。
  6. 根据权利要求1所述的标签建立方法,其中,所述根据所述至少一个概率向量确定所述新闻文本的标签包括:
    从所述打标签指令中确定所述新闻文本所属的目标领域,所述打标签指令中所携带的信息包括所述目标领域;
    从配置库中确定与所述目标领域对应的目标词典,所述配置库中存储多个领域与多个词典的映射关系;
    对于所述至少一个概率向量,将每个概率向量中概率最大的维度确定为目标维度,得到所述至少一个概率向量的至少一个目标维度;
    将所述至少一个目标维度在所述目标词典中对应的类别确定为所述新闻文本的标签。
  7. 根据权利要求1所述的标签建立方法,其中,在根据所述至少一个概率向量确定所述新闻文本的标签后,所述标签建立方法还包括:
    从所述打标签指令中获取所述新闻文本的文本编号;
    根据所述文本编号及所述标签生成提示信息;
    采用对称加密技术加密所述提示信息,得到密文;
    将所述密文发送至指定联系人的终端设备。
  8. 一种电子设备,其中,所述电子设备包括处理器和存储器,所述处理器用于执行存储器中存储的至少一个计算机可读指令以实现以下步骤:
    当接收到打标签指令时,从所述打标签指令中提取新闻文本;
    对所述新闻文本进行预处理,得到至少一个分词;
    对所述至少一个分词进行编码,得到与所述至少一个分词对应的至少一个第一特征向量;
    对所述至少一个第一特征向量中每个第一特征向量进行上下文特征提取,得到与每个第一特征向量对应的第二特征向量;
    将每个第一特征向量及每个第二特征向量输入至预先训练的目标模型中,得到与所述至少一个第一特征向量对应的至少一个目标特征向量,每个目标特征向量是由编码向量及位置向量拼接而成,每个编码向量是由每个第一特征向量及每个第二特征向量拼接而成,每个位置向量是根据每个第一特征向量的位置及每个第二特征向量的位置确定的;
    对所述至少一个目标特征向量进行映射处理,得到至少一个概率向量;
    根据所述至少一个概率向量确定所述新闻文本的标签。
  9. 根据权利要求8所述的电子设备,其中,在所述对所述新闻文本进行预处理,得到至少一个分词时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:
    过滤所述新闻文本中的配置字符,得到第一文本;
    对所述第一文本中的预设字段进行词法分析处理,得到第二文本;
    根据预设词典对所述第二文本进行切分,得到切分位置;
    根据所述第二文本及所述切分位置,构建有向无环图;
    根据所述预设词典中的权值计算所述有向无环图中每条路径的概率;
    将概率最大的路径对应的切分位置确定为目标切分位置;
    根据所述目标切分位置确定所述至少一个分词。
  10. 根据权利要求8所述的电子设备,其中,在所述对所述至少一个第一特征向量中每个第一特征向量进行上下文特征提取,得到与每个第一特征向量对应的第二特征向量时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:
    接收配置的向量个数;
    对于每个第一特征向量,根据所述至少一个第一特征向量及所述向量个数确定该第一特征向量对应的上下文特征向量集;
    将所述上下文特征向量集中每个特征向量分别与第一预设矩阵进行相乘、及计算相乘后的向量的平均值,得到中间向量;
    将所述中间向量点乘第二预设矩阵,得到目标矩阵,所述目标矩阵中每列向量表征每个词对应的向量;
    采用激活函数计算所述目标矩阵中每个词的概率;
    将概率最大的词对应的向量确定为所述第二特征向量。
  11. 根据权利要求8所述的电子设备,其中,在将每个第一特征向量及每个第二特征向量输入至预先训练的目标模型中,得到与所述至少一个第一特征向量对应的至少一个目标特征向量之前,所述处理器执行所述至少一个计算机可读指令还用以实现以下步骤:
    采用网络爬虫技术获取历史数据;
    将所述历史数据输入到遗忘门层进行遗忘处理,得到训练数据,所述训练数据中每个训练数据包括第一输入向量、第二输入向量及已知输出向量;
    采用交叉验证法将所述训练数据划分为训练集及验证集;
    基于所述训练集中的第一输入向量、第二输入向量及已知输出向量进行训练,得到学习器;
    将所述验证集中的第一输入向量及第二输入向量输入至所述学习器中,得到待测输出向量,及比较所述待测输出向量及所述已知输出向量;
    当所述待测输出向量及所述已知输出向量不一致时,根据所述验证集中的第一输入向量、第二输入向量及已知输出向量调整所述学习器,得到所述目标模型。
  12. 根据权利要求8所述的电子设备,其中,在所述对所述至少一个目标特征向量进行映射处理时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:
    将所述至少一个目标特征向量分别乘以预设权重矩阵、及加上预设偏置值,得到至少一个分数向量;
    对所述至少一个分数向量进行归一化处理,得到所述至少一个概率向量。
  13. 根据权利要求8所述的电子设备,其中,在所述根据所述至少一个概率向量确定所述新闻文本的标签时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:
    从所述打标签指令中确定所述新闻文本所属的目标领域,所述打标签指令中所携带的信息包括所述目标领域;
    从配置库中确定与所述目标领域对应的目标词典,所述配置库中存储多个领域与多个词典的映射关系;
    对于所述至少一个概率向量,将每个概率向量中概率最大的维度确定为目标维度,得到所述至少一个概率向量的至少一个目标维度;
    将所述至少一个目标维度在所述目标词典中对应的类别确定为所述新闻文本的标签。
  14. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理器执行时实现以下步骤:
    当接收到打标签指令时,从所述打标签指令中提取新闻文本;
    对所述新闻文本进行预处理,得到至少一个分词;
    对所述至少一个分词进行编码,得到与所述至少一个分词对应的至少一个第一特征向量;
    对所述至少一个第一特征向量中每个第一特征向量进行上下文特征提取,得到与每个第一特征向量对应的第二特征向量;
    将每个第一特征向量及每个第二特征向量输入至预先训练的目标模型中,得到与所述至少一个第一特征向量对应的至少一个目标特征向量,每个目标特征向量是由编码向量及位置向量拼接而成,每个编码向量是由每个第一特征向量及每个第二特征向量拼接而成,每个位置向量是根据每个第一特征向量的位置及每个第二特征向量的位置确定的;
    对所述至少一个目标特征向量进行映射处理,得到至少一个概率向量;
    根据所述至少一个概率向量确定所述新闻文本的标签。
  15. 根据权利要求14所述的存储介质,其中,在所述对所述新闻文本进行预处理,得到至少一个分词时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:
    过滤所述新闻文本中的配置字符,得到第一文本;
    对所述第一文本中的预设字段进行词法分析处理,得到第二文本;
    根据预设词典对所述第二文本进行切分,得到切分位置;
    根据所述第二文本及所述切分位置,构建有向无环图;
    根据所述预设词典中的权值计算所述有向无环图中每条路径的概率;
    将概率最大的路径对应的切分位置确定为目标切分位置;
    根据所述目标切分位置确定所述至少一个分词。
  16. 根据权利要求14所述的存储介质,其中,在所述对所述至少一个第一特征向量中每个第一特征向量进行上下文特征提取,得到与每个第一特征向量对应的第二特征向量时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:
    接收配置的向量个数;
    对于每个第一特征向量,根据所述至少一个第一特征向量及所述向量个数确定该第一特征向量对应的上下文特征向量集;
    将所述上下文特征向量集中每个特征向量分别与第一预设矩阵进行相乘、及计算相乘后的向量的平均值,得到中间向量;
    将所述中间向量点乘第二预设矩阵,得到目标矩阵,所述目标矩阵中每列向量表征每个词对应的向量;
    采用激活函数计算所述目标矩阵中每个词的概率;
    将概率最大的词对应的向量确定为所述第二特征向量。
  17. 根据权利要求14所述的存储介质,其中,在将每个第一特征向量及每个第二特征向量输入至预先训练的目标模型中,得到与所述至少一个第一特征向量对应的至少一个目标特征向量之前,所述至少一个计算机可读指令被处理器执行还用以实现以下步骤:
    采用网络爬虫技术获取历史数据;
    将所述历史数据输入到遗忘门层进行遗忘处理,得到训练数据,所述训练数据中每个训练数据包括第一输入向量、第二输入向量及已知输出向量;
    采用交叉验证法将所述训练数据划分为训练集及验证集;
    基于所述训练集中的第一输入向量、第二输入向量及已知输出向量进行训练,得到学习器;
    将所述验证集中的第一输入向量及第二输入向量输入至所述学习器中,得到待测输出向量,及比较所述待测输出向量及所述已知输出向量;
    当所述待测输出向量及所述已知输出向量不一致时,根据所述验证集中的第一输入向量、第二输入向量及已知输出向量调整所述学习器,得到所述目标模型。
  18. 根据权利要求14所述的存储介质,其中,在所述对所述至少一个目标特征向量进行 映射处理,得到至少一个概率向量时,所述至少一个计算机可读指令被处理器执行时还用以实现以下步骤:
    将所述至少一个目标特征向量分别乘以预设权重矩阵、及加上预设偏置值,得到至少一个分数向量;
    对所述至少一个分数向量进行归一化处理,得到所述至少一个概率向量。
  19. 根据权利要求14所述的存储介质,其中,在所述根据所述至少一个概率向量确定所述新闻文本的标签时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:
    从所述打标签指令中确定所述新闻文本所属的目标领域,所述打标签指令中所携带的信息包括所述目标领域;
    从配置库中确定与所述目标领域对应的目标词典,所述配置库中存储多个领域与多个词典的映射关系;
    对于所述至少一个概率向量,将每个概率向量中概率最大的维度确定为目标维度,得到所述至少一个概率向量的至少一个目标维度;
    将所述至少一个目标维度在所述目标词典中对应的类别确定为所述新闻文本的标签。
  20. 一种标签建立装置,其中,所述标签建立装置包括:
    提取单元,用于当接收到打标签指令时,从所述打标签指令中提取新闻文本;
    预处理单元,用于对所述新闻文本进行预处理,得到至少一个分词;
    编码单元,用于对所述至少一个分词进行编码,得到与所述至少一个分词对应的至少一个第一特征向量;
    所述提取单元,还用于对所述至少一个第一特征向量中每个第一特征向量进行上下文特征提取,得到与每个第一特征向量对应的第二特征向量;
    输入单元,用于将每个第一特征向量及每个第二特征向量输入至预先训练的目标模型中,得到与所述至少一个第一特征向量对应的至少一个目标特征向量,每个目标特征向量是由编码向量及位置向量拼接而成,每个编码向量是由每个第一特征向量及每个第二特征向量拼接而成,每个位置向量是根据每个第一特征向量的位置及每个第二特征向量的位置确定的;
    处理单元,用于对所述至少一个目标特征向量进行映射处理,得到至少一个概率向量;
    确定单元,用于根据所述至少一个概率向量确定所述新闻文本的标签。
PCT/CN2020/105633 2020-03-31 2020-07-29 标签建立方法、装置、电子设备及介质 WO2021196468A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010243203.9A CN111553148A (zh) 2020-03-31 2020-03-31 标签建立方法、装置、电子设备及介质
CN202010243203.9 2020-03-31

Publications (1)

Publication Number Publication Date
WO2021196468A1 true WO2021196468A1 (zh) 2021-10-07

Family

ID=72005512

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/105633 WO2021196468A1 (zh) 2020-03-31 2020-07-29 标签建立方法、装置、电子设备及介质

Country Status (2)

Country Link
CN (1) CN111553148A (zh)
WO (1) WO2021196468A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114091458A (zh) * 2021-11-12 2022-02-25 北京明略软件***有限公司 基于模型融合的实体识别方法和***
CN114254075A (zh) * 2021-12-13 2022-03-29 北京惠及智医科技有限公司 标签识别方法、装置、电子设备和存储介质
CN114386421A (zh) * 2022-01-13 2022-04-22 平安科技(深圳)有限公司 相似新闻检测方法、装置、计算机设备和存储介质
CN114862141A (zh) * 2022-04-20 2022-08-05 平安科技(深圳)有限公司 基于画像关联性的课程推荐方法、装置、设备及存储介质
CN116150698A (zh) * 2022-09-08 2023-05-23 天津大学 一种基于语义信息融合的drg自动分组方法及***
CN118069777A (zh) * 2023-12-27 2024-05-24 伟金投资有限公司 一种收集信息自动归类再发布方法和***

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112507120B (zh) * 2021-02-07 2021-06-04 上海二三四五网络科技有限公司 一种保持分类一致性的预测方法及装置
CN113268614B (zh) * 2021-05-25 2024-06-04 平安银行股份有限公司 标签体系更新方法、装置、电子设备及可读存储介质
CN113204698B (zh) * 2021-05-31 2023-12-26 平安科技(深圳)有限公司 新闻主题词生成方法、装置、设备及介质
CN113342977B (zh) * 2021-06-22 2022-10-28 深圳壹账通智能科技有限公司 ***图像分类方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180349350A1 (en) * 2017-06-01 2018-12-06 Beijing Baidu Netcom Science And Technology Co., Ltd. Artificial intelligence based method and apparatus for checking text
CN109710922A (zh) * 2018-12-06 2019-05-03 深港产学研基地产业发展中心 文本识别方法、装置、计算机设备和存储介质
CN110287323A (zh) * 2019-06-27 2019-09-27 成都冰鉴信息科技有限公司 一种面向目标的情感分类方法
CN110399488A (zh) * 2019-07-05 2019-11-01 深圳和而泰家居在线网络科技有限公司 文本分类方法及装置
CN110705206A (zh) * 2019-09-23 2020-01-17 腾讯科技(深圳)有限公司 一种文本信息的处理方法及相关装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180349350A1 (en) * 2017-06-01 2018-12-06 Beijing Baidu Netcom Science And Technology Co., Ltd. Artificial intelligence based method and apparatus for checking text
CN109710922A (zh) * 2018-12-06 2019-05-03 深港产学研基地产业发展中心 文本识别方法、装置、计算机设备和存储介质
CN110287323A (zh) * 2019-06-27 2019-09-27 成都冰鉴信息科技有限公司 一种面向目标的情感分类方法
CN110399488A (zh) * 2019-07-05 2019-11-01 深圳和而泰家居在线网络科技有限公司 文本分类方法及装置
CN110705206A (zh) * 2019-09-23 2020-01-17 腾讯科技(深圳)有限公司 一种文本信息的处理方法及相关装置

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114091458A (zh) * 2021-11-12 2022-02-25 北京明略软件***有限公司 基于模型融合的实体识别方法和***
CN114254075A (zh) * 2021-12-13 2022-03-29 北京惠及智医科技有限公司 标签识别方法、装置、电子设备和存储介质
CN114386421A (zh) * 2022-01-13 2022-04-22 平安科技(深圳)有限公司 相似新闻检测方法、装置、计算机设备和存储介质
CN114862141A (zh) * 2022-04-20 2022-08-05 平安科技(深圳)有限公司 基于画像关联性的课程推荐方法、装置、设备及存储介质
CN116150698A (zh) * 2022-09-08 2023-05-23 天津大学 一种基于语义信息融合的drg自动分组方法及***
CN116150698B (zh) * 2022-09-08 2023-08-22 天津大学 一种基于语义信息融合的drg自动分组方法及***
CN118069777A (zh) * 2023-12-27 2024-05-24 伟金投资有限公司 一种收集信息自动归类再发布方法和***

Also Published As

Publication number Publication date
CN111553148A (zh) 2020-08-18

Similar Documents

Publication Publication Date Title
WO2021196468A1 (zh) 标签建立方法、装置、电子设备及介质
WO2021151299A1 (zh) 基于人工智能的数据增强方法、装置、电子设备及介质
WO2022105122A1 (zh) 基于人工智能的答案生成方法、装置、计算机设备及介质
CN111695033B (zh) 企业舆情分析方法、装置、电子设备及介质
WO2021174723A1 (zh) 训练样本扩充方法、装置、电子设备及存储介质
WO2021217931A1 (zh) 基于分类模型的字段抽取方法、装置、电子设备及介质
CN111737499B (zh) 基于自然语言处理的数据搜索方法及相关设备
CN111026319B (zh) 一种智能文本处理方法、装置、电子设备及存储介质
CN111967242A (zh) 一种文本信息的抽取方法、装置及设备
CN111291566B (zh) 一种事件主体识别方法、装置、存储介质
CN113051356B (zh) 开放关系抽取方法、装置、电子设备及存储介质
WO2022088671A1 (zh) 自动问答方法、装置、设备及存储介质
CN113656547B (zh) 文本匹配方法、装置、设备及存储介质
CN111026320B (zh) 多模态智能文本处理方法、装置、电子设备及存储介质
WO2021196825A1 (zh) 摘要生成方法、装置、电子设备及介质
US20240185602A1 (en) Cross-Modal Processing For Vision And Language
CN109979439B (zh) 基于区块链的语音识别方法、装置、介质及电子设备
CN113836866B (zh) 文本编码方法、装置、计算机可读介质及电子设备
CN112053143B (zh) 资金路由方法、装置、电子设备及存储介质
CN116151132A (zh) 一种编程学习场景的智能代码补全方法、***及储存介质
CN113064973A (zh) 文本分类方法、装置、设备及存储介质
CN112749561A (zh) 一种实体识别方法及设备
CN111966811A (zh) 意图识别和槽填充方法、装置、可读存储介质及终端设备
CN111552798B (zh) 基于名称预测模型的名称信息处理方法、装置、电子设备
CN112668325B (zh) 一种机器翻译增强方法、***、终端及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20929126

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 230123)

122 Ep: pct application non-entry in european phase

Ref document number: 20929126

Country of ref document: EP

Kind code of ref document: A1