CN115618873A - Data processing method and device, computer equipment and storage medium - Google Patents

Data processing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN115618873A
CN115618873A CN202110793816.4A CN202110793816A CN115618873A CN 115618873 A CN115618873 A CN 115618873A CN 202110793816 A CN202110793816 A CN 202110793816A CN 115618873 A CN115618873 A CN 115618873A
Authority
CN
China
Prior art keywords
target
word
entity
entity word
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110793816.4A
Other languages
Chinese (zh)
Inventor
黄婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110793816.4A priority Critical patent/CN115618873A/en
Publication of CN115618873A publication Critical patent/CN115618873A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a data processing method, a data processing device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring a target text, and acquiring a target entity word from the target text, wherein the target entity word is a user name of a target user; determining the associated entity words of the target entity words from the target text; identifying the target entity words to obtain first word vectors of the target entity words, and identifying the associated entity words to obtain second word vectors of the associated entity words; and performing label prediction processing on the target entity word by combining the first word vector and the second word vector to obtain a prediction label of the target entity word, wherein the prediction label is used for indicating whether the user identity of the target user is the target identity, so that the user identity corresponding to the entity word serving as the user name can be mined and predicted based on the vector identification of the entity word.

Description

Data processing method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data processing method and apparatus, a computer device, and a storage medium.
Background
With the continuous and deep development of internet technology, natural language processing technology is also continuously and deeply developed, and the processing efficiency of a text at present can be effectively improved through the assistance of the natural language processing technology, but currently, many natural language technologies including Name Entity Recognition (NER) are applied, and the NER technology can be used for recognizing proper nouns such as names, place names, organization names and the like in an input text, that is, only the discrimination of Entity word types in the input text can be realized based on the NER technology, but other related information described by Entity words cannot be further and deeply mined, so that how to mine other related information described by the Entity words is realized, and the method becomes a current research hotspot.
Disclosure of Invention
The embodiment of the invention provides a data processing method, a data processing device, computer equipment and a storage medium, which can realize mining and prediction of a user identity corresponding to an entity word serving as a user name based on vector representation of the entity word.
In one aspect, an embodiment of the present invention provides a data processing method, including:
acquiring a target text, and acquiring a target entity word from the target text, wherein the target entity word is a user name of a target user;
determining an associated entity word of the target entity word from the target text, wherein the associated entity word is any one or more other entity words except the target entity word in the target text, and the associated entity word comprises a descriptor related to the identity characteristics of the target user;
identifying the target entity words to obtain first word vectors of the target entity words, and identifying the associated entity words to obtain second word vectors of the associated entity words;
and performing label prediction processing on the target entity word by combining the first word vector and the second word vector to obtain a prediction label of the target entity word, wherein the prediction label is used for indicating whether the user identity of the target user is the target identity.
In another aspect, an embodiment of the present invention provides a data processing apparatus, including:
the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a target text and acquiring a target entity word from the target text, and the target entity word is a user name of a target user;
a determining unit, configured to determine, from the target text, an associated entity word of the target entity word, where the associated entity word is any one or more other entity words except the target entity word in the target text, and the associated entity word includes a descriptor related to an identity feature of the target user;
the processing unit is used for identifying the target entity words to obtain first word vectors of the target entity words and identifying the associated entity words to obtain second word vectors of the associated entity words;
the processing unit is further configured to perform label prediction processing on the target entity word in combination with the first word vector and the second word vector to obtain a prediction label of the target entity word, where the prediction label is used to indicate whether the user identity of the target user is a target identity.
In another aspect, an embodiment of the present invention provides a computer device, including a processor, an input device, an output device, and a memory, where the processor, the input device, the output device, and the memory are connected to each other, where the memory is used to store a computer program that supports the computer device to execute the above method, the computer program includes program instructions, and the processor is configured to call the program instructions to perform the following steps:
acquiring a target text, and acquiring a target entity word from the target text, wherein the target entity word is a user name of a target user;
determining associated entity words of the target entity words from the target text, wherein the associated entity words are any one or more other entity words except the target entity words in the target text, and the associated entity words comprise descriptors related to the identity characteristics of the target user;
identifying the target entity words to obtain first word vectors of the target entity words, and identifying the associated entity words to obtain second word vectors of the associated entity words;
and performing label prediction processing on the target entity word by combining the first word vector and the second word vector to obtain a prediction label of the target entity word, wherein the prediction label is used for indicating whether the user identity of the target user is the target identity.
In yet another aspect, an embodiment of the present invention provides a computer-readable storage medium, in which program instructions are stored, and when the program instructions are executed by a processor, the program instructions are used to execute the data processing method according to the first aspect.
In the embodiment of the present invention, after obtaining the target text, the computer device may obtain a user name of the target user from the target text, and use the obtained user name as the target entity word, and after determining the target entity word, the computer device may further determine an associated entity word of the target entity word from the target text, so that the computer device may obtain the entity word with the non-coherent semantics based on the obtaining that the target in the target text is the entity word and the associated entity word. After the computer device obtains the target entity word and the associated entity word, the computer device may respectively perform recognition processing on the target entity word to obtain a first word vector of the target entity word, and perform recognition processing on the associated entity word to obtain a second word vector of the associated entity word, then, further, based on the obtained first word vector and second word vector, the computer device may predict, in combination with the first word vector and the second word vector, a prediction tag of the target entity word, so as to determine, according to the prediction tag, whether a target user indicated by the target entity word is a target identity, and thus, the computer device may implement a mode of determining, based on an entity word of non-coherent semantics, a tag corresponding to the target entity word (i.e., a refined entity type corresponding to the target entity word), and perform tag prediction on the target entity word by using the entity word of non-coherent semantics, which may effectively improve convenience of the computer device in predicting the entity word tag, thereby improving efficiency of performing entity prediction on the entity tag on the computer device.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1a is a schematic structural diagram of a target network model according to an embodiment of the present invention;
FIG. 1b is a schematic structural diagram of a target network model according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart diagram of a data processing method provided by an embodiment of the invention;
FIG. 3 is a schematic flow chart diagram of a data processing method provided by an embodiment of the invention;
FIG. 4 is a schematic diagram of optimization training of a target network model according to an embodiment of the present invention;
FIG. 5a is a diagram illustrating an intent search based on a search vocabulary according to an embodiment of the present invention;
FIG. 5b is a schematic diagram of a scenario for performing an intent search according to an embodiment of the present invention;
FIG. 6 is a schematic block diagram of a data processing apparatus provided by an embodiment of the present invention;
fig. 7 is a schematic block diagram of a computer device provided by an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a data processing method, which enables a computer device to use an acquired user name of a target user as a target entity word, use an entity word having a context relation with the user name of the target user as an associated entity word of the target entity word, and after the target entity word and the associated entity word are acquired, the computer device can further perform recognition processing on the target entity word and the associated entity word respectively, so as to obtain a first word vector corresponding to the target entity word and a second word vector corresponding to the associated entity word. In one embodiment, the context includes the above and the below, then, the entity word having a context relationship with a certain entity word (such as the above target entity word) includes the above and the below of the text (i.e. the target text) in which the entity word is located, where the above (or the below) refers to the entity word before (or after) the target entity word in the target text, and the entity word refers to the word having a specific meaning or referring to a real object, then it can be understood that the associated entity word having an association relationship with the target entity word includes: in the text where the target entity word is located, other words except the target entity word, pronouns, stop words, auxiliary words and the like which have no specific meaning or cannot refer to the real object, if the target text is "small a commentary brilliant game video", when the target entity word is "small a", the associated entity word determined based on the context relationship includes "commentary and game video", and does not include "brilliant" auxiliary words which have no specific meaning. Furthermore, the non-semantically coherent entity words are distinguished from the semantically coherent entity words, wherein the semantically coherent entity words are all words formed by the participles obtained from the target text, and the non-semantically coherent entity words refer to partial words obtained by the participles from the target text.
In an embodiment, the computer device may respectively perform recognition processing on the target entity word and the associated entity word by invoking a target network model, so as to obtain a first word vector of the target entity word and a second word vector of the associated entity word, where the target network model is a neural network for distinguishing a specific user identity of a target user denoted by the target entity word, that is, the target network model may be used to distinguish whether the target user is a specific user identity, where the specific user identity may be, for example, a game description identity, a live broadcast identity, a host identity, or the like. The model structure of the target network model may specifically include a word vector generation network, a semantic coding network, and an identity discrimination network, where the word vector generation network is configured to convert each character (or word) in the entity word (including the target entity word and an entity word associated with the target entity word) into a vector representation of a fixed length; the semantic coding network can be used for performing semantic analysis on the entity words to obtain vector representations of corresponding entity words by combining semantic analysis results of the entity words after vector representations corresponding to each character (or word) in the associated entity words are obtained, the identity distinguishing network can also be called as an identity distinguishing module, and the identity distinguishing module can distinguish the user identities of the target users indicated by the target entity words based on the vector representations of the entity words (including the vector representations of the target entity words and the vector identifications of the associated entity words). In one embodiment, the model structure of the target network model may be as shown in fig. 1a, and the target network model may perform semantic analysis on the target entity word and the associated entity word respectively when obtaining the first word vector of the target entity word and the second word vector of the associated entity word, so that both the first word vector of the target entity word and the second word vector of the associated entity word contain entity word senses; in addition, the model structure of the target network model may also be as shown in fig. 1b, and the target network model may be used only for performing semantic analysis on the associated entity word, so that the second word vector representation corresponding to the associated entity word introduces the entity word meaning, and does not perform semantic analysis on the target entity word, that is, the first word vector of the target entity word is obtained by splicing the vector representations corresponding to each character in the target entity word.
After obtaining the first word vector corresponding to the target entity word and the second word vector of the associated entity word, the identity recognition network (i.e., the identity recognition module) may perform label prediction processing on the target entity word based on the first word vector and the second word vector, so as to obtain a prediction label of the target entity word, and then the computer device may obtain an identity prediction result of the target user according to the obtained prediction label for the target entity word. In an embodiment, the word vector generation network included in the target network model may include a plurality of semantic coding networks, that is, different entity words may adopt different word vector generation networks to generate vector representations corresponding to characters in corresponding entity words, and different semantic coding networks may be adopted to perform semantic analysis on the entity words so as to introduce the semantics of the entity words into the vector representations of the entity words, or the word vector generation network included in the neural network model may include only one semantic coding network, so that after the target entity words and the associated entity words are input into the target network model, the neural network model multiplexes the word vector generation network and the semantic coding network, and further obtains the vector representations of the target entity words and the vector representations of the associated entity words. In addition, when the semantic coding network performs semantic recognition on an entity word, an attention mechanism can be introduced, wherein the word vector generation network can be realized by adopting a word embedding vector (word embedding) based algorithm or a word2vec (a generation algorithm for a character vector in the entity word), and the semantic coding network can be realized by adopting a Long Short-Term Memory network (LSTM) or other semantic coding networks such as a Bi-directional Long Short-Term Memory network (Bi-LSTM), that is, in the embodiment of the present invention, the specific implementation manner of each neural network in the neural network model is not limited.
In an embodiment, the target text acquired by the computer device may be any training text in a training text set, when the target text acquired by the computer device is a training text, the computer device may call a target network model to obtain a prediction tag of a target entity word, perform optimization training on the target network model based on a real tag corresponding to the target entity word, and further obtain a trained target network model, and after the computer device obtains the target network model, may use the obtained target network model to predict a tag of the target entity word in another text (i.e., a text without a real tag), so as to implement mining prediction of a user identity of a user corresponding to the target entity word of the unknown tag based on a prediction result of the tag corresponding to the target entity word, thereby improving accuracy of judging whether the user identity of the user corresponding to the user name in the text of the computer device is the target identity. It can be understood that the computer device can predict the obtained prediction tag based on the target network model when the real tag of the target entity word is unknown, and then judge whether the target user is the user identity based on the obtained prediction tag, thereby realizing mining prediction of the user identity corresponding to the target entity word in other texts.
Referring to fig. 2, a schematic flowchart of a data processing method according to an embodiment of the present invention is shown, where the data processing method can be specifically executed by the computer device, and as shown in fig. 2, the method can include:
s201, obtaining a target text, and obtaining a target entity word from the target text, wherein the target entity word is a user name of a target user.
S202, determining the associated entity words of the target entity words from the target text, wherein the associated entity words are any one or more other entity words except the target entity words in the target text, and the associated entity words comprise description words related to the identity characteristics of the target user.
In step S201 and step S202, the target text is a text including an entity word of a user name type, and the entity word of the user name type included in the target text is a user name of the target user, so that after the computer device obtains the target text, the computer device may use an entity word of the user name type corresponding to the entity word in the target text as the target entity word. In a specific implementation, the computer device may determine a text fragment corresponding to the target text after obtaining the target text, where when determining the text fragment of the target text, the computer device may first invoke a segmentation service and an Entity word Recognition service, so as to obtain a segmentation result (i.e., a segmentation set) of the target text and an Entity word Recognition result, in one embodiment, the computer device may implement segmentation Processing on the target text by using a Natural Language Processing algorithm (NLP) to obtain the segmentation set corresponding to the target text, and when performing Entity word Recognition on the target text, the computer device may perform Entity word Recognition Processing on the target text by using a Named Entity Recognition (NER) algorithm to obtain an Entity word Recognition result corresponding to the target Entity word. Among them, NLP is an important direction in the fields of computer science and artificial intelligence. It researches various theories and methods for realizing effective communication between people and computers by natural language, and natural language processing is a science integrating linguistics, computer science and mathematics, so that the research in this field relates to natural language, namely the language used by people daily, and therefore, it has close relation with linguistics research. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question answering, knowledge mapping, and the like.
In one embodiment, if the target text is "splendid segment of the royal of the juxtapose commentary video game", the segmentation set obtained by the computer device after performing segmentation processing on the target text may be "xiaxi/commentary/game/video/splendid segment", and the entity word recognition result obtained after performing entity word recognition processing on the target text may be "xiaxi _ username/splendid _ IP-game name", then the computer device may further perform filtering processing on the segmentation set after obtaining the segmentation set and the entity word recognition result of the target text, so that the segmentation set includes only the segments that do not constitute the entity words, and filters out the labels and stop words, etc., and the segmentation set after performing filtering processing may be "splendid/commentary/video/splendid segment" based on the segmentation set of the target text, and the filtered segmentation set is equivalent to the original word set.
After the computer device obtains the filtered participle set, the filtered participle set and the entity word recognition result can be merged to obtain the text fragment corresponding to the target text, wherein the participle and the entity word are not distinguished in the text fragment, and in the embodiment of the invention, the words in the fragment are collectively referred to as the entity word. Based on the filtered word segmentation set 'xianxi/comment/game/video/won/highlight segment' obtained by the word segmentation processing of the target text and the entity word recognition result 'xianxi _ username/won _ IP-game name' obtained by the entity word recognition processing, the computer device can finally obtain the text segment of the target text, namely 'xianxi _ username/won _ IP-game name/comment/game/video/highlight segment'. The IP is a proper name category returned after calling the NER service, the IP comprises a game name, a movie name, a book name, a music name, a game name and the like, when the NER service is called, the IP name can be correctly identified, or fine category information of the IP can be further identified when the IP name is identified, for example, when an input text is 'honor of a Xiaoxi commentary game video king', calling the NER service can identify that the honor is an IP category, and the fine category is a game category. Then, after the computer device determines the text segment corresponding to the target text, since the entity words of the user name type are marked in the text segment, the computer device may select the entity words marked as the user name type (e.g., xiaxi) as the target entity words based on the text segment, and it can be understood that the target entity words may be referred to as target users named xiaxi.
After obtaining the text fragment corresponding to the target text, the computer device may further determine, based on the text fragment corresponding to the target text, an associated entity word of the target entity word, where the associated entity word may be any one or more other entity words except the target entity word in the text fragment corresponding to the target text. Similarly, if the text segment of the target text is "xianxi _ username/royal honor _ IP-game name/commentary/game/video/highlight segment", after acquiring the target entity word "xiaxi" from the target text, the computer device uses any one or more other entity words (such as royal honor, commentary, or the like) except "xiaxi" in the text segment of the target text as associated entity words of the target entity word, it is understood that the associated entity words are some descriptor words related to the identity characteristics of the target user, or the associated entity words may also be any other entity words without specific meaning and having only a context relationship with the target entity word. In the embodiment of the present invention, the computer device may also acquire the target entity word in the target text in other manners, for example, only the NER processing is performed on the target text, and then, based on the processing result, the entity word marked as the user name category in the target text is used as the target entity word, and then, any word other than the target entity word may be selected from the target text as the associated entity word.
After the computer device obtains the target entity word and the associated entity word of the target entity word from the target text, the computer device may invoke the target network model to perform recognition processing on the target entity word and the associated entity word, so as to obtain a first word vector corresponding to the target entity word and a second word vector corresponding to the associated entity word, respectively, i.e., the step S203 is executed in turn.
S203, identifying the target entity words to obtain first word vectors of the target entity words, and identifying the associated entity words to obtain second word vectors corresponding to the associated entity words.
And S204, performing label prediction processing on the target entity word by combining the first word vector and the second word vector to obtain a prediction label of the target entity word, wherein the prediction label is used for indicating whether the user identity of the target user is the target identity.
In steps S203 and S204, when the computer device performs recognition processing on the target entity word and the associated entity word and obtains a first word vector of the target entity word and a second word vector of the associated entity word, respectively, the computer device may perform processing on the target entity word and the associated entity word, and further obtain a vector representation of each character in the target entity word and a vector representation of each character in the associated entity word. In a specific implementation, the computer device may obtain a vector representation of each character by calling a target network model, where when the target network model is called to generate the vector representation of each character, the target network model may generate a network based on a word vector in the target network model, call the word vector generation network at the computer device to obtain the vector representation of each character in the target entity word and the vector representation of each character in the associated entity word, and may directly perform a splicing process on the vector representations of each character in the target entity word to obtain a first word vector of the target entity word, and after obtaining the vector of each character in the associated entity word, the computer device may further call a semantic coding network to perform semantic coding on the associated entity word and obtain a second word vector of the associated entity word based on the semantic coding, and may implement a semantic coding process on the associated entity word based on the coding network, and implement an introduction of the associated entity word into the corresponding second word vector.
After obtaining a first word vector corresponding to a target entity word and a second word vector corresponding to an associated entity word, the computer device may input the first word vector and the second word vector to an output layer of the target network model, where the output layer is a softmax (a logistic regression algorithm) layer including an identity discrimination network, and it may be understood that after obtaining the first word vector and the second word vector, the output layer may perform tag prediction processing on the target entity word to obtain a prediction tag of the target entity word. In a specific implementation, when the output layer performs tag prediction processing on a target entity word based on a first word vector and a second word vector, the output layer first generates two candidate tags for the target entity word and a confidence degree corresponding to each candidate tag, where the two generated candidate tags include a first candidate tag used to indicate that a target user is a target identity and a second candidate tag used to indicate that the target user is not the target identity, and the computer device may determine a prediction tag corresponding to the target entity word based on the generated candidate tags and the corresponding confidence degrees. Or, in another implementation manner, if the confidence of the first candidate tag included in the generated candidate tag is a, the confidence of the second candidate tag included in the generated candidate tag is b, and a > b, then the computer device may also determine that the predicted tag of the target entity word is the first candidate tag.
In an embodiment, the target network model may be a trained network model, and then, based on the trained network model, tag prediction may be performed on a user name based on the user name and a corresponding context, so as to determine a user identity of a user corresponding to the user name, thereby mining a user identity corresponding to an unknown user name, and promoting data expansion of the user corresponding to the user name in the target user identity. When the target network model is an untrained network model, the computer equipment can try to determine a real label of the target entity word after obtaining a predicted label of the target entity word, so that the target network model can be optimally trained on the basis of a label difference between the predicted label and the real label to obtain the trained target network model, and the trained target network model can be adopted to predict the label of the target entity word with unknown real label, thereby enhancing the recall of the user corresponding to the target user with the target identity.
In the embodiment of the present invention, after obtaining the target text, the computer device may obtain a user name of the target user from the target text, and use the obtained user name as the target entity word, and after determining the target entity word, the computer device may further determine an associated entity word of the target entity word from the target text, so that the computer device may obtain the entity word with the non-coherent semantics based on the obtaining that the target in the target text is the entity word and the associated entity word. After the computer device obtains the target entity word and the associated entity word, the computer device can respectively perform recognition processing on the target entity word to obtain a first word vector of the target entity word, and perform recognition processing on the associated entity word to obtain a second word vector of the associated entity word, so that further, based on the obtained first word vector and second word vector, the computer device can predict a prediction tag of the target entity word according to the first word vector and the second word vector, so as to determine whether a target user indicated by the target entity word is a target identity according to the prediction tag, and thus, the computer device can determine a tag corresponding to the target entity word (namely, a refined entity type corresponding to the target entity word) based on non-coherent semantics, and adopt the non-coherent entity word to perform tag prediction on the target entity word, so that convenience of the computer device in predicting the entity word tag can be effectively improved, and thus the efficiency of the computer device in predicting the entity word can be improved.
Referring to fig. 3, which is a schematic flow chart of a data processing method according to an embodiment of the present invention, the method may also be executed by the computer device, and the embodiment of the present invention mainly explains a training process of a target network model in detail, as shown in fig. 3, the method may include:
s301, acquiring a target text, and acquiring a target entity word from the target text, wherein the target entity word is a user name of a target user.
S302, determining the associated entity words of the target entity words from the target text, wherein the associated entity words are any one or more other entity words except the target entity words in the target text, and the associated entity words comprise descriptors related to the identity characteristics of the target user.
In one embodiment, the target text may be any training text selected from a training text set, and when the target text is a training text in the training text set, model training on the target network model may be implemented, where each training text included in the training text set includes a target entity word, that is, when the computer device acquires the target entity word from the target text, the computer device may acquire the training text set first, and then may acquire a user name of a target user from the training text set as the target entity word, where the training text set refers to a text set including the same user name, that is, each training text in the training text set includes the same user name, and the included same user name is the user name of the target user. In one embodiment, the computer device may obtain the training text set according to a search log of a user history search, and in a specific implementation, the computer device may first obtain the search log, and determine a text segment (segment) of each search text in the search log, where the text segment includes one or more entity words, and an entity word tagged with an entity word category exists in the one or more entity words; then, after the computer device determines the text fragments of each search text in the search log, according to the entity words of which the text fragments of each search text are labeled with the entity word categories, text fragments which correspond to the entity word categories, are selected from the text fragments, are the user name categories and are the same user name, wherein the text fragments with the same user name are used for forming a training text set.
In one embodiment, the search log is related data generated by a computer device when a computer device performs data search through the internet by counting one or more users within a certain time range, and the search log at least comprises search texts. The search text may be, for example, "a commentary game video of a small a", "a game commentary live broadcast of a small a", or "a live broadcast sale of a small B", and the computer device performs statistics on the search text, and a time range of obtaining the search log may be, for example, one day, or one hour, and is not limited in the embodiment of the present invention. The computer device can determine the text fragment of each search text in the search log after obtaining the search log, wherein when determining the text fragment of any search text, the computer device can firstly perform word segmentation processing and entity word recognition processing on any search text to obtain a word segmentation set of any search text and an entity word recognition result, wherein the entity word recognition result comprises one or more entity words and the category of each entity word; and then, the computer device can filter the word segmentation set, and adopt the filtered word segmentation set and the entity word recognition result to form a text fragment of any search text, wherein the way for obtaining the text fragment of any search text by the computer device can be referred to the way for obtaining the word segmentation in the above embodiments, and is not described herein again.
If a search log obtained by the computer device is recorded as a set Q, each search text included in the search log is recorded as a search query, and for any search query in the search log Q, if it is assumed to be query _ i, a text fragment of query _ i can be obtained by adopting the above text fragment generation manner, where the text fragment of query _ i can be recorded as query _ i _ fragments, and it is assumed that query _ i _ fragments = { t _1, t _2, t _3, …, t _ n }, where t _ i represents the i-th participle in the text fragment of query _ i. After the computer device obtains the text fragment of each search text in the set Q corresponding to the search log, if there is no entity word of the username category in the text fragment query _ i _ segments, the computer device does not further process the search text, and can directly discard the search text. In another implementation manner, assuming that an entity word of the user name category exists in the text fragment query _ i _ segments, and t _ j in the text fragment query _ i _ segments is an entity word (i.e., a user name) of the user name category, the computer device may screen out a query of the entity word existing in the user name category from the set Q corresponding to the search log, and add text fragments including the queries of the same user name in the screened queries to the training text set, so that the same user name appears in the training text set.
Based on the training text set determined by the computer device, the computer device may use the user name appearing in each text fragment in the training text set as the target entity word, and determine the associated entity word of the target entity word. In a specific implementation, if t _ j in the text fragment query _ i _ segments included in the training text set is a target entity word, a context set of the target entity word in the text fragment query _ i _ segments may be denoted as person _ query _ i _ context, and specifically, person _ query _ i _ context = query _ i _ segments- { t _ j }, that is, the context set of the target entity word in the text fragment may include a difference set between the text fragment query _ i _ segments and the target entity word t _ j. And then the computer device can count the difference set between each text fragment in the training text set and the target entity word, and after the computer device obtains the context of the target entity word of each text fragment granularity, the computer device can select part or all entity words from the obtained context as the associated entity words of the target entity word, and simultaneously record the occurrence frequency of each context, so that all contexts (namely associated entity words) of the target entity word in the whole and the occurrence frequency corresponding to each context are obtained.
If the training text set includes text segments corresponding to 3 queries, the following (1) to (3) may be specifically used:
(1) xiaxi commentary game video: xiaxi-name/commentary/game/video;
(2) xiaxi explains the highlights of the game video: small xi _ name/commentary/highlights;
(3) the little xi explains the prosperity: xixi, name/king, honor, IP-game/comment;
then, it can be understood that the target entity word obtained by the computer device is a small word, and the associated entity word of the target entity word and the corresponding occurrence number are: the joker glows |1 times/explain |3 times/play |1 times/video |1 times/highlight |1 times. Based on the occurrence frequency of each associated entity word counted by the computer device, the computer device can determine the importance score of each associated entity word, and then the computer device can determine the real label of the target entity word based on the importance score, so that the target network model is optimally trained by adopting the real label.
S303, calling the target network model to identify the target entity word to obtain a first word vector of the target entity word.
S304, calling a target network model to identify and process the associated entity words corresponding to the target entity words in each training text to obtain second word vectors corresponding to each associated entity word.
S305, calling the target network model, and performing label prediction processing on the target entity word by combining the first word vector and the second word vector to obtain a prediction label of the target entity word, wherein the prediction label is used for indicating whether the user identity of the target user is the target identity.
After the computer equipment obtains the target entity words, the target network model can be called to carry out recognition processing on the target entity words to obtain first word vectors of the target entity words, the target network model is called to carry out recognition processing on the associated entity words corresponding to the target entity words in each training text to obtain second word vectors corresponding to the associated entity words, and then the computer equipment can carry out optimization training on the target network model by combining the real labels of the target entity words after obtaining the prediction labels of the target entity words based on the first word vectors and the second word vectors. In one embodiment, the computer device may determine the real label of the target entity word based on the obtained importance score corresponding to each associated entity word; the importance degree score is used for indicating the accuracy degree of describing the identity characteristics of the target user by adopting the corresponding associated entity words, if the number of the associated entity words determined by the computer equipment is N, N is more than or equal to 1 and is an integer; the computer device may obtain the occurrence number of any associated entity word in the N associated entity words, so that the importance score corresponding to any associated entity word may be determined according to the occurrence number. In one embodiment, the computer device may use the number of occurrences of any associated entity word in the N associated entity words as the importance score of any associated entity word, or the computer device may first perform normalization processing on the number of occurrences of any associated entity word in the N associated entity words, and then may use the number after the normalization processing as the importance score of any associated entity word. For example, if the associated entity words and the corresponding occurrence times determined by the computer device are: the joker glows for |1 time/explains |3 times/plays |1 time/videos |1 time/wonderful segments |1 time, then the computer equipment can determine that the importance score corresponding to the glowing of the associated entity word is 1 score, and the importance score corresponding to the associated entity word explanation is 3 scores and the like; or the importance score corresponding to the glory of the associated entity word explanation is determined to be 1/7, the importance score corresponding to the associated entity word explanation is determined to be 3/7 and the like.
After determining the importance score of each associated entity word, the computer device may determine, based on the importance score, a true tag of the target entity word, where the true tag is used to indicate whether the target entity word (or the target text) is a positive case or a negative case, where a positive case indicates that the user identity of the target user indicated by the target entity word is the target identity, and a negative case indicates that the user identity of the target user indicated by the target entity word is not the target identity. When determining the real label of the target entity word based on the importance degree scores, the computer device may first sequentially arrange the N associated entity words in the order from high to low according to the corresponding importance degree scores, and sequentially select the L associated entity words from the first position of the arrangement backwards according to the arrangement order; l is more than or equal to 1 and less than or equal to N, and L is an integer; and if the selected L associated entity words include the reference entity word, determining that the real tag of the target entity word is the first tag, where the first tag is used to indicate that the user identity of the target user is the target identity, where the general value of N may be 10, 15, etc., and the general value of L is 5 or 8, etc. In one embodiment, the reference entity words are preset according to target identities to be determined, descriptors related to identity characteristics of the target identities include: a game commentary identity; the reference entity words preset for the game description identity include any one or more of the following: the user of the game explanation identity refers to a user assisting listeners and audiences to follow the progress of the game, describing what happens in the game and explaining the own view of the commentator, and a large number of commentary users upload own game explanation videos and a large number of fans search the game videos explained by the game commentator on a video vertical website. And the target identity may also be a live identity, and then the reference entity word preset for the live identity includes any one or more of the following: live broadcast, and entity words of the trade name category, etc.
In one embodiment, the associated entity words with the same importance scores may be randomly arranged, wherein if the associated entity words and the corresponding occurrence times determined by the computer device are: the royal glory |1 times/commentary |3 times/game |2 times/video |1 times/highlight |1 times/live |1 times, then the computer device may rank the associated entity words in order of importance scores from high to low, and the order may be: commenting, the royal glory, the game, the video, the highlight, live broadcast, and then this computer equipment can elect L relevant entity word in proper order, and supposing that L is 5, then, the L relevant entity word of electing includes: narration, the joker glory, games, video, highlights. If the computer device determines that the reference entity word is an entity word of the commentary and the appearing game name (or the game category), the computer device may determine that the selected L associated entity words include the reference entity word, and then the computer device may determine that the target entity word is a positive example and determine that the real tag of the target entity word is the first tag. And if the selected L associated entity words do not include the reference entity word, i.e., do not include the explanation, and the game name does not appear, the computer device may determine that the target entity word is a negative example, and determine that the real label of the target entity word is the second label.
In one embodiment, if the L associated entity words selected by the computer device do not include the reference entity word, the real tag of the target entity word may be determined to be a second tag, where the second tag is used to indicate that the user identity of the target user is not the target identity; or, if the L associated entity words selected by the computer device do not include the reference entity word, the computer device may further select J associated entity words sequentially backward from the L +1 th associated entity word, and when the selected L + J associated entity words do not include the reference entity word, determine that the target entity word is a negative example, and determine that the real tag of the target entity word is the second tag, J is greater than or equal to 1 and less than or equal to N, and J is an integer. For example, if 5 associated entity words selected by the computer device do not include the reference entity word, the real tag of the target entity word may be directly determined to be the second tag, or the computer device may further select J (assuming that 5) associated entity words backward, so that when none of the selected 10 (L + J) associated entity words includes the reference entity word, the target entity word is determined to be a negative example, and the corresponding real tag is the second tag. That is, if the target identity that needs to be determined is the game commentator, the computer device may first determine, based on the importance scores of the associated entity words, the associated entity word whose corresponding importance score is ranked at the TOP L (e.g., TOP 5), and determine whether "commentary" appears in the associated entity words of TOP5 and includes the game category entity word, and when the determination is yes, determine that the target entity word is a positive example, and determine that the true tag of the target entity word is the first tag. And if it is determined that the corresponding importance score is ranked in the TOP L + J (e.g., TOP 10) associated real word, no "description" appears, and the game category real word is not included, determining that the target real word is a negative example, and determining that the real tag of the target real word is the second tag.
After the computer device determines the real label of the target entity word, the label difference between the predicted label and the real label can be determined according to the predicted label and the real label of the target entity word, so that the computer device can adjust the model parameter of the target network model according to the label difference, and when the label difference between the predicted label and the real label is smaller than the preset difference, the training of the target network model is stopped, and the trained target network model is obtained. In an embodiment, the target network Model is a TOP _ n Context Entity Typing Model (an Entity word processing Model), and as shown in fig. 4, after acquiring the target Entity word and the associated Entity word of the target Entity word, the computer device inputs the target Entity word into the target network Model to obtain a first word vector of the target Entity word, and inputs the associated Entity word into the target network Model to obtain a second word vector of the associated Entity word, where the associated Entity word input into the target network Model by the computer device is an associated Entity word with a first n-bit (e.g., the first 10-bit or the first 15-bit) corresponding to the importance score, and after acquiring the input associated Entity word, the target network Model may respectively invoke a corresponding semantic encoding network to process one associated Entity word to obtain the second word vector of each associated Entity word, so that the target network Model may obtain the vector based on the first word vector and the second word vector, and when the target Entity word prediction Model and the target Entity label of the target Entity word are not identical, obtain a target Entity label prediction parameter, and optimize the target Entity label of the target Entity word prediction Model to obtain a target Entity label prediction parameter. Practice shows that F1 of the trained target network model can reach 90% in the test set, the trained target network model is used for predicting the user name which is not covered by the rule and the corresponding context, and finally manual sampling inspection is carried out on the prediction result.
After the computer device obtains the prediction tag of the target entity word, if the prediction tag indicates that the user identity of the target user is the target identity, the computer device may add the target entity word to a search vocabulary, where the user identity of the user corresponding to each entity word included in the search vocabulary is the target identity, and further, the computer device may add the search vocabulary to an intention identification module, so that the intention identification module performs intention search feedback based on the search vocabulary, and performs intention search based on the search vocabulary. In one embodiment, the computer device may obtain an intention search word from the user side, and obtain a search user name from the intention search word, where the intention search word is an entity word not pointing to specific search content, and further, the computer device may look up whether the search user name is included in the search word table, and when it is determined that the search user name is included, use data of the search user corresponding to the search user name under the target identity as search feedback data, thereby outputting the search feedback data at the user side. In a specific implementation, if the target identity is a game commentary identity, the search vocabulary obtained by the computer device is a game commentator vocabulary, a user name indicating that the user identity of the corresponding user is the game commentary identity is recorded in the game commentator vocabulary, and the determination of the game commentator vocabulary can be used for expanding a game commentator knowledge base of the existing NER service and further increasing the recognition recall of the NER by the NER service to the game commentator.
In another implementation, the mined game commentator vocabulary may also be applied to the intention recognition module, where the intention class query refers to a keyword type of a class of video content rather than a specific user search word, for example, query = "captions" is a short video intention class query, the intention recognition module recognizes search intentions of the search words and key slot position information, for example, for query = "captions", the intention recognition module may output an intention result that a game commentator slot position is "captions", so as to output relevant videos when the captions are game commentators, and feed back the relevant videos to the user commentators to the user commentator side, as shown in fig. 5a, that is, the mined game commentator may be applied to recall of extended intention recognition. In one embodiment, the intention search word may be sent to a server by a search user through a terminal device, where the search user may input the intention search word through the terminal device 50 when performing a data search based on the intention search word, as shown in fig. 5b, and after obtaining the intention search word, the terminal device may send the intention search word to the server 51, then the server 51 may obtain a search user name based on the intention search word, and when determining that the search user name is included in the search word list, send data of the search user corresponding to the search user name under the target identity to the terminal device 50 as search feedback data, then the terminal device 50 may display the search feedback data obtained by searching based on the intention search word to the search user, thereby completing an intention search process of the search user, where if the input intention search word is the above-mentioned "small white comment", it may be understood that the search feedback data finally displayed in the terminal device 50 may be as an intention area example interface shown in fig. 5 a.
When the search user inputs the intention search word through the terminal device 50, the intention search word may be input into the target application program, that is, the intention search word acquired by the terminal device 50 is input into the target application program by the search user, and the target application program may be a search engine, or other application programs supporting a search function, and the like, where the application program may be an independent client in the terminal device, or may also be an applet in a certain client, and the like, which is not limited in the embodiment of the present invention. It should be noted that, after the terminal device obtains the intended search word, the terminal device may also directly determine the search feedback data, and output and display the search feedback data.
In the embodiment of the present invention, the computer device may first pre-process the search log based on the NER technology and the word segmentation technology to obtain the target entity word, and then determine the real tag of the target entity word according to the context of the target entity word, and further, the computer device may further perform recognition processing on the discrete context (i.e., the associated entity word of the target entity word) and the target entity word to obtain the first word vector of the target entity word and the second word vector of each associated entity word, so as to call the target network model to obtain the predicted tag of the target entity word based on the first word vector and the second word vector, so as to train the target network model based on the predicted tag and the real tag, and finally predict the search log not covered by the rule based on the trained target network model to expand the recognition recall of the commentator, and further generate the user word list corresponding to the target identity, so that the computer device may perform the subsequent intent search based on the user word list to improve the accuracy of the search.
Based on the description of the above data processing method embodiment, an embodiment of the present invention further provides a data processing apparatus, which may be a computer program (including a program code) running in the computer device. The data processing apparatus may be used to execute the data processing method as shown in fig. 2 and fig. 3, please refer to fig. 6, and the data processing apparatus includes: an acquisition unit 601, a determination unit 602 and a processing unit 603.
An obtaining unit 601, configured to obtain a target text and obtain a target entity word from the target text, where the target entity word is a user name of a target user;
a determining unit 602, configured to determine, from the target text, an associated entity word of the target entity word, where the associated entity word is any one or more other entity words except the target entity word in the target text, and the associated entity word includes a descriptor related to an identity feature of the target user;
the processing unit 603 is configured to perform recognition processing on the target entity word to obtain a first word vector of the target entity word, and perform recognition processing on the associated entity word to obtain a second word vector of the associated entity word;
the processing unit 603 is further configured to perform label prediction processing on the target entity word in combination with the first word vector and the second word vector, so as to obtain a prediction label of the target entity word, where the prediction label is used to indicate whether the user identity of the target user is a target identity.
In one embodiment, if the target text is any one of a set of training texts, each of the set of training texts includes the target entity word; the processing unit 603 is specifically configured to:
and identifying the associated entity words corresponding to the target entity words included in each training text to obtain a second word vector corresponding to each associated entity word.
In one embodiment, the first word vector, the second word vector, and the predictive label are obtained by calling a target network model, and the target network model is a neural network for user identity discrimination; the device further comprises: an adjustment unit 604.
The obtaining unit 601 is further configured to obtain an importance score corresponding to each associated entity word;
the determining unit 602 is further configured to determine a real label of the target entity word according to the importance scores; the importance degree score is used for indicating the accuracy degree of describing the identity characteristics of the target user by adopting corresponding associated entity words;
an adjusting unit 604, configured to determine, according to a predicted tag and a real tag of the target entity word, a tag difference between the predicted tag and the real tag, and adjust a model parameter of the target network model according to the tag difference;
the adjusting unit 604 is further configured to stop the training of the target network model when the tag difference is smaller than a preset difference, so as to obtain a trained target network model.
In one embodiment, the number of the associated entity words is N, wherein N is more than or equal to 1 and is an integer; the determining unit 602 is specifically configured to:
the N associated entity words are sequentially arranged from high to low according to the corresponding importance scores, and L associated entity words are sequentially selected backwards from the first arranged entity words according to the arrangement sequence; l is more than or equal to 1 and less than or equal to N, and L is an integer;
if the selected L associated entity words comprise reference entity words, determining that the real label of the target entity word is a first label, wherein the first label is used for indicating that the user identity of the target user is the target identity, and the reference entity words are descriptors which are preset based on the target identity and are related to the identity characteristics of the target identity.
In an embodiment, the determining unit 602 is further configured to determine, if the selected L associated entity words do not include the reference entity word, that a real tag of the target entity word is a second tag, where the second tag is used to indicate that the user identity of the target user is not a target identity; alternatively, the first and second electrodes may be,
the determining unit 602 is further configured to, if the selected L associated entity words do not include the reference entity word, sequentially select J associated entity words from the L +1 th associated entity word backward, and determine that the real label of the target entity word is the second label, J is greater than or equal to 1 and less than or equal to N, and J is an integer, when the selected L + J associated entity words do not include the reference entity word.
In one embodiment, the number of the associated entity words is N, wherein N is greater than or equal to 1 and is an integer; the obtaining unit 601 is specifically configured to:
acquiring the occurrence frequency of any associated entity word in the N associated entity words, and determining the importance degree score corresponding to any associated entity word according to the occurrence frequency;
the importance degree score corresponding to any associated entity word is the number of times of occurrence of any associated entity word in the N associated entity words, or the importance degree score corresponding to any associated entity word is the number of times of normalization of the number of times of occurrence of any associated entity word in the N associated entity words.
In an embodiment, the processing unit 603 is specifically configured to:
acquiring a search log, and determining a text fragment of each search text in the search log, wherein the text fragment comprises one or more entity words, and the entity words marked with entity word categories exist in the one or more entity words;
and according to the entity words marked with the entity word categories in the text fragments of each search text, screening out the text fragments of which the corresponding entity word categories are the user name categories and the text fragments of which the user names are the same, wherein the text fragments of which the user names are the same form a training text set.
In an embodiment, the processing unit 603 is specifically configured to:
aiming at any search text, performing word segmentation processing and entity word identification processing on the search text to obtain a word segmentation set of the search text and an entity word identification result, wherein the entity word identification result comprises one or more entity words and the category of each entity word;
and filtering the word segmentation set, wherein the filtered word segmentation set and the entity word recognition result form a text fragment of any search text.
In this embodiment of the present invention, after the target text is obtained, the obtaining unit 601 may obtain a user name of the target user from the target text, and use the obtained user name as the target entity word, and after the determining unit 602 determines the target entity word, may further determine an associated entity word of the target entity word from the target text, so that based on the obtaining that the target in the target text is the entity word and the associated entity word, the entity word with non-coherent semantics may be obtained. After the target entity word and the associated entity word are obtained, the processing unit 603 may respectively perform recognition processing on the target entity word to obtain a first word vector of the target entity word, and perform recognition processing on the associated entity word to obtain a second word vector of the associated entity word, then, further, based on the obtained first word vector and second word vector, the processing unit 603 may predict a prediction tag of the target entity word in combination with the first word vector and the second word vector, so as to determine whether a target user indicated by the target entity word is a target identity according to the prediction tag, and thus, determining a tag corresponding to the target entity word (i.e., a refined entity type corresponding to the target entity word) based on a non-coherent semantic entity word can be achieved, and a manner of performing tag prediction on the target entity word by using the non-coherent entity word can effectively improve convenience in predicting an entity word tag, thereby improving efficiency in predicting a tag of an entity word.
Fig. 7 is a schematic block diagram of a structure of a computer device according to an embodiment of the present invention, where the computer device may be a server, and the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, content Delivery Network (CDN), and a big data and artificial intelligence platform; in addition, the computer device may also be a terminal device, and the terminal device may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The computer device in the present embodiment shown in fig. 7 may include: one or more processors 701; one or more input devices 702, one or more output devices 703, and memory 704. The processor 701, the input device 702, the output device 703, and the memory 704 are connected by a bus 705. The memory 704 is used to store a computer program comprising program instructions, and the processor 701 is used to execute the program instructions stored by the memory 704.
The memory 704 may include volatile memory (volatile memory), such as random-access memory (RAM); the memory 704 may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a solid-state drive (SSD), etc.; the memory 704 may also comprise a combination of the above types of memory.
The processor 701 may be a Central Processing Unit (CPU). The processor 701 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or the like. The PLD may be a field-programmable gate array (FPGA), general Array Logic (GAL), or the like. The processor 701 may also be a combination of the above structures.
In an embodiment of the present invention, the memory 704 is configured to store a computer program, the computer program includes program instructions, and the processor 701 is configured to execute the program instructions stored in the memory 704, so as to implement the steps of the corresponding methods as described above in fig. 2 and fig. 3.
In one embodiment, the processor 701 is configured to call the program instructions to perform:
acquiring a target text, and acquiring a target entity word from the target text, wherein the target entity word is a user name of a target user;
determining associated entity words of the target entity words from the target text, wherein the associated entity words are any one or more other entity words except the target entity words in the target text, and the associated entity words comprise descriptors related to the identity characteristics of the target user;
identifying the target entity word to obtain a first word vector of the target entity word, and identifying the associated entity word to obtain a second word vector of the associated entity word;
and performing label prediction processing on the target entity word by combining the first word vector and the second word vector to obtain a prediction label of the target entity word, wherein the prediction label is used for indicating whether the user identity of the target user is the target identity.
Embodiments of the present invention provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method embodiments as shown in fig. 2 or fig. 3. The computer-readable storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
While the invention has been described with reference to a particular embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (14)

1. A method of data processing, comprising:
acquiring a target text, and acquiring a target entity word from the target text, wherein the target entity word is a user name of a target user;
determining associated entity words of the target entity words from the target text, wherein the associated entity words are any one or more other entity words except the target entity words in the target text, and the associated entity words comprise descriptors related to the identity characteristics of the target user;
identifying the target entity word to obtain a first word vector of the target entity word, and identifying the associated entity word to obtain a second word vector of the associated entity word;
and performing label prediction processing on the target entity word by combining the first word vector and the second word vector to obtain a prediction label of the target entity word, wherein the prediction label is used for indicating whether the user identity of the target user is the target identity.
2. The method of claim 1, wherein if the target text is any one of a set of training texts, each of the set of training texts includes the target entity word; the identifying the associated entity word to obtain a second word vector corresponding to the associated entity word includes:
and identifying and processing the associated entity words corresponding to the target entity words included in each training text to obtain a second word vector corresponding to each associated entity word.
3. The method of claim 1, wherein the first word vector, the second word vector, and the predictive tag are all derived by invoking a target network model, the target network model being a neural network for user identity discrimination; the method further comprises the following steps:
acquiring an importance score corresponding to each associated entity word, and determining a real label of the target entity word according to the importance score; the importance degree score is used for indicating the accuracy degree of describing the identity characteristics of the target user by adopting corresponding associated entity words;
determining label difference between the predicted label and the real label according to the predicted label and the real label of the target entity word, and adjusting the model parameter of the target network model according to the label difference;
and when the label difference is smaller than a preset difference, stopping training the target network model to obtain the trained target network model.
4. The method of claim 3, wherein the number of associated entity words is N, wherein N ≧ 1 and is an integer; the determining the real label of the target entity word according to the importance score comprises:
the N associated entity words are sequentially arranged from high to low according to the corresponding importance scores, and L associated entity words are sequentially selected backwards from the first arranged position according to the arrangement sequence; l is more than or equal to 1 and less than or equal to N, and L is an integer;
if the selected L associated entity words comprise reference entity words, determining that the real label of the target entity word is a first label, wherein the first label is used for indicating that the user identity of the target user is the target identity, and the reference entity words are descriptors which are preset based on the target identity and are related to the identity characteristics of the target identity.
5. The method of claim 4, wherein the method further comprises:
if the selected L associated entity words do not include the reference entity word, determining that the real label of the target entity word is a second label, wherein the second label is used for indicating that the user identity of the target user is not the target identity; alternatively, the first and second electrodes may be,
if the selected L associated entity words do not include the reference entity word, sequentially selecting J associated entity words from the L +1 th associated entity word backwards, and when the selected L + J associated entity words do not include the reference entity word, determining that the real label of the target entity word is the second label, J is larger than or equal to 1 and is smaller than or equal to N, and J is an integer.
6. The method of claim 4 or 5, wherein the target identity comprises: a game commentary identity;
the reference entity words preset for the game description identity include any one or more of the following: narration, and physical words of any game name category.
7. The method of claim 3, wherein the number of associated entity words is N, wherein N ≧ 1 and is an integer; the obtaining of the importance score corresponding to each associated entity word includes:
acquiring the occurrence times of any associated entity word in the N associated entity words, and determining the importance degree score corresponding to any associated entity word according to the occurrence times;
the importance degree score corresponding to any associated entity word is the number of times of occurrence of any associated entity word in the N associated entity words, or the importance degree score corresponding to any associated entity word is the number of times of normalization of the number of times of occurrence of any associated entity word in the N associated entity words.
8. The method of claim 2, wherein the training text set is obtained in a manner comprising:
acquiring a search log, and determining a text fragment of each search text in the search log, wherein the text fragment comprises one or more entity words, and the entity words marked with entity word categories exist in the one or more entity words;
according to the entity words marked with the entity word categories in the text fragments of each search text, screening out the text fragments of which the corresponding entity word categories are the user name categories and the text fragments of which the user names are the same, wherein the text fragments of which the user names are the same form a training text set.
9. The method of claim 8, wherein the determining a text snippet for each piece of search text in the search log comprises:
aiming at any search text, performing word segmentation processing and entity word identification processing on the search text to obtain a word segmentation set of the search text and an entity word identification result, wherein the entity word identification result comprises one or more entity words and the category of each entity word;
and filtering the word segmentation set, wherein the filtered word segmentation set and the entity word recognition result form a text fragment of any search text.
10. The method of claim 1, further comprising:
if the prediction tag indicates that the user identity of the target user is the target identity, adding the target entity word into a search word list, wherein the user identity of the user corresponding to each entity word in the search word list is the target identity;
and adding the search word list to an intention identification module so that the intention identification module can perform intention search feedback based on the search word list.
11. The method of claim 10, wherein the method further comprises:
acquiring an intention search word, and acquiring a search user name from the intention search word, wherein the intention search word is an entity word which does not point to specific search content;
and searching whether the search user name is included in the search word list, and when the search user name is determined to be included, taking data of a search user corresponding to the search user name under the target identity as search feedback data and outputting the search feedback data.
12. A data processing apparatus, comprising:
the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a target text and acquiring a target entity word from the target text, and the target entity word is a user name of a target user;
a determining unit, configured to determine, from the target text, an associated entity word of the target entity word, where the associated entity word is any one or more other entity words except the target entity word in the target text, and the associated entity word includes a descriptor related to an identity feature of the target user;
the processing unit is used for identifying the target entity words to obtain first word vectors of the target entity words and identifying the associated entity words to obtain second word vectors of the associated entity words;
the processing unit is further configured to perform label prediction processing on the target entity word by combining the first word vector and the second word vector to obtain a prediction label of the target entity word, where the prediction label is used to indicate whether the user identity of the target user is a target identity.
13. A computer device comprising a processor, an input device, an output device and a memory, the processor, the input device, the output device and the memory being interconnected, wherein the memory is configured to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 1 to 11.
14. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1 to 11.
CN202110793816.4A 2021-07-13 2021-07-13 Data processing method and device, computer equipment and storage medium Pending CN115618873A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110793816.4A CN115618873A (en) 2021-07-13 2021-07-13 Data processing method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110793816.4A CN115618873A (en) 2021-07-13 2021-07-13 Data processing method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115618873A true CN115618873A (en) 2023-01-17

Family

ID=84856176

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110793816.4A Pending CN115618873A (en) 2021-07-13 2021-07-13 Data processing method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115618873A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117591660A (en) * 2024-01-18 2024-02-23 杭州威灿科技有限公司 Material generation method, equipment and medium based on digital person

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117591660A (en) * 2024-01-18 2024-02-23 杭州威灿科技有限公司 Material generation method, equipment and medium based on digital person
CN117591660B (en) * 2024-01-18 2024-04-16 杭州威灿科技有限公司 Material generation method, equipment and medium based on digital person

Similar Documents

Publication Publication Date Title
CN108829893B (en) Method and device for determining video label, storage medium and terminal equipment
US11514235B2 (en) Information extraction from open-ended schema-less tables
US9449271B2 (en) Classifying resources using a deep network
US9626622B2 (en) Training a question/answer system using answer keys based on forum content
CN110929125B (en) Search recall method, device, equipment and storage medium thereof
WO2015083309A1 (en) Mining forums for solutions to questions
CN112287069B (en) Information retrieval method and device based on voice semantics and computer equipment
Steinmetz et al. Semantic multimedia information retrieval based on contextual descriptions
US9727617B1 (en) Systems and methods for searching quotes of entities using a database
US10915756B2 (en) Method and apparatus for determining (raw) video materials for news
US11861918B2 (en) Image analysis for problem resolution
CN109635157A (en) Model generating method, video searching method, device, terminal and storage medium
US20230032728A1 (en) Method and apparatus for recognizing multimedia content
CN112015928A (en) Information extraction method and device of multimedia resource, electronic equipment and storage medium
CN110309355B (en) Content tag generation method, device, equipment and storage medium
CN115618873A (en) Data processing method and device, computer equipment and storage medium
CN110245357B (en) Main entity identification method and device
US11386163B2 (en) Data search method and data search system thereof for generating and comparing strings
US20230090601A1 (en) System and method for polarity analysis
CN110888896B (en) Data searching method and data searching system thereof
CN114792092B (en) Text theme extraction method and device based on semantic enhancement
CN114780757A (en) Short media label extraction method and device, computer equipment and storage medium
CN113505293A (en) Information pushing method and device, electronic equipment and storage medium
CN111310016B (en) Label mining method, device, server and storage medium
CN113392312A (en) Information processing method and system and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40080355

Country of ref document: HK