CN116069948B - Content wind control knowledge base construction method, device, equipment and storage medium - Google Patents

Content wind control knowledge base construction method, device, equipment and storage medium Download PDF

Info

Publication number
CN116069948B
CN116069948B CN202310094574.9A CN202310094574A CN116069948B CN 116069948 B CN116069948 B CN 116069948B CN 202310094574 A CN202310094574 A CN 202310094574A CN 116069948 B CN116069948 B CN 116069948B
Authority
CN
China
Prior art keywords
entity
wind control
content
ontology
relation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310094574.9A
Other languages
Chinese (zh)
Other versions
CN116069948A (en
Inventor
张凤珍
靳国庆
李罗政
张冬明
张勇东
辛瑞佳
曲畅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
People's Network Information Technology Co ltd
Konami Sports Club Co Ltd
Original Assignee
People's Network Information Technology Co ltd
People Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by People's Network Information Technology Co ltd, People Co Ltd filed Critical People's Network Information Technology Co ltd
Priority to CN202310094574.9A priority Critical patent/CN116069948B/en
Publication of CN116069948A publication Critical patent/CN116069948A/en
Application granted granted Critical
Publication of CN116069948B publication Critical patent/CN116069948B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a method, a device, equipment and a storage medium for constructing a content wind control knowledge base. Wherein the method comprises the following steps: modeling the content wind control domain ontology according to preset corpus data; wherein the modeling comprises ontology concept modeling and ontology relationship modeling; content wind control knowledge extraction is carried out according to the modeled ontology concepts and ontology relations; the content wind control knowledge extraction comprises entity relation extraction and entity extraction; and constructing a content wind control knowledge base according to the extracted entity relationship and the entity. According to the method, the system and the device, the content wind control knowledge is formed through body design, the domain knowledge base oriented to the content wind control is constructed, knowledge support is provided for the content wind control technical service based on the knowledge graph, a reliable content wind control knowledge base is provided for language understanding and knowledge reasoning of a computer, and the accuracy and reliability of intelligent auditing are improved.

Description

Content wind control knowledge base construction method, device, equipment and storage medium
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a method, a device, equipment and a storage medium for constructing a content wind control knowledge base.
Background
In face of the increasing amount of internet content data and the regulatory requirements of content security, the content wind control system services using technology as a main tool continue to expand. However, most of the traditional content wind control knowledge bases are literature bases, and cannot provide structured and systematic wind control knowledge, so that application requirements in the field of content wind control are difficult to meet. With the rapid progress of artificial intelligence and knowledge graph technology, the content wind control knowledge base with knowledge reasoning and knowledge updating capabilities is increasingly in urgent need, and has very important application space.
Disclosure of Invention
In view of the foregoing, the present application is directed to providing a method, apparatus, device, and storage medium for building a content pneumatic control knowledge base that overcomes or at least partially solves the foregoing problems.
According to one aspect of the present application, there is provided a content wind control knowledge base construction method, including:
modeling the content wind control domain ontology according to preset corpus data; wherein the modeling comprises ontology concept modeling and ontology relationship modeling;
content wind control knowledge extraction is carried out according to the modeled ontology concepts and ontology relations; the content wind control knowledge extraction comprises entity relation extraction and entity extraction;
and constructing a content wind control knowledge base according to the extracted entity relationship and the entity.
According to another aspect of the present application, there is provided a content wind-controlled knowledge base construction apparatus, including:
the modeling module is used for modeling the body of the content wind control field according to preset corpus data; wherein the modeling comprises ontology concept modeling and ontology relationship modeling;
the knowledge extraction module is used for carrying out content wind control knowledge extraction according to the modeled ontology concepts and ontology relations; the content wind control knowledge extraction comprises entity relation extraction and entity extraction;
and the knowledge base construction module is used for constructing a content wind control knowledge base according to the extracted entity relation and the entity.
According to another aspect of the present application, there is provided an electronic device including: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the content wind control knowledge base construction method.
According to another aspect of the present application, there is provided a computer storage medium, where at least one executable instruction is stored, where the executable instruction causes a processor to execute operations corresponding to the content wind control knowledge base construction method described herein.
According to the method, the device and the storage medium for constructing the content wind control knowledge base, the content wind control domain ontology is modeled according to preset corpus data; wherein the modeling comprises ontology concept modeling and ontology relationship modeling; content wind control knowledge extraction is carried out according to the modeled ontology concepts and ontology relations; the content wind control knowledge extraction comprises entity relation extraction and entity extraction; and constructing a content wind control knowledge base according to the extracted entity relationship and the entity. The content wind control knowledge is formed through ontology design, a domain knowledge base oriented to content wind control is constructed, knowledge support is provided for content wind control technical service based on a knowledge graph, a reliable content wind control knowledge base is provided for language understanding and knowledge reasoning of a computer, and accuracy and reliability of intelligent auditing are improved.
The foregoing description is only an overview of the technical solutions of the present application, and may be implemented according to the content of the specification in order to make the technical means of the present application more clearly understood, and in order to make the above-mentioned and other objects, features and advantages of the present application more clearly understood, the following detailed description of the present application will be given.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
fig. 1 is a schematic flow chart of a method for constructing a content wind-controlled knowledge base according to an embodiment of the present application;
fig. 2 illustrates an ontology modeling schematic diagram in a content wind-controlled knowledge base construction method according to a first embodiment of the present application;
fig. 3 is a schematic diagram of entity relation extraction and entity extraction in a method for constructing a content wind-controlled knowledge base according to an embodiment of the present application;
fig. 4 is a schematic diagram of a modeling process of a NARRE double-tower model in a content wind control knowledge base construction method according to a second embodiment of the application;
FIG. 5 is a schematic structural diagram of a content wind-controlled knowledge base construction device according to a third embodiment of the present application;
fig. 6 shows a schematic structural diagram of an electronic device according to a fifth embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Example 1
Fig. 1 is a schematic flow chart of a method for constructing a content pneumatic control knowledge base according to an embodiment of the present application. As shown in fig. 1, the method includes:
step S11, modeling the body of the content wind control field according to preset corpus data; wherein modeling includes ontology conceptual modeling and ontology relational modeling.
The preset corpus data may be obtained in advance, for example, may be sentences, words, or the like. Specifically, the main stream media official netpage can be used as a main stream media official netpage, a mode of deep from point to surface layer to layer is adopted, and through network linking, hypertext markup language (Hyper Text Markup Language, HTML) markup language is deeply analyzed, and HTML markup content is acquired and analyzed at fixed time to obtain original corpus data. After the original corpus data is obtained, preprocessing is carried out on the original corpus data, including de-duplication of the original corpus data with multiple sources, removal of labels and special characters in texts, and the like. In the feature extraction process, the embodiment comprehensively utilizes the topic relevance of the words to calculate the keyword weight, further extracts text features, combines a related similarity algorithm to obtain the data semantic similarity degree, integrates a rapid clustering algorithm to obtain the final semantic similarity, and realizes the de-duplication of the original corpus data to obtain the preset corpus data.
Wherein, the ontology is an important knowledge base and represents basic terms and relations of vocabulary in the subject field. The content wind control field body is a system comprising content wind control terms and normative relations and descriptions among the terms. According to the embodiment, the ontology terms are extracted by adopting a multi-strategy fusion method, the preset corpus data are subjected to word segmentation, word part analysis after word segmentation and the like based on stop words, number words, graduated words, date and place nouns, the named entity identifies low-frequency personal names, manual screening keywords and other element design domain term filtering algorithms, the initial terms are subjected to multi-round filtering, words which have no obvious meaning, disordered grammar structures or semantically close meanings in the terms are filtered, and finally the ontology terms in the content wind control domain are obtained.
Step S12, content wind control knowledge extraction is carried out according to the modeled ontology concepts and ontology relations; the content wind control knowledge extraction comprises entity relation extraction and entity extraction.
Specifically, firstly, a fine-tuning pre-training language model of a classification task is adopted to extract entity relations, and then entity relation information and the pre-training language model are fused to extract the entity.
And S13, constructing a content wind control knowledge base according to the extracted entity relation and the entity.
Specifically, an RDF (Resource Description Framework, resource framework system) storage system may be employed to store content-aware knowledge as graph data, using a relational database as the underlying storage.
Therefore, the embodiment models the body of the content wind control field according to the preset corpus data; the modeling comprises ontology concept modeling and ontology relation modeling; content wind control knowledge extraction is carried out according to the modeled ontology concept and ontology relationship; the content wind control knowledge extraction comprises entity relation extraction and entity extraction; and constructing a content wind control knowledge base according to the extracted entity relationship and the entity. The content wind control knowledge is formed through ontology design, a content wind control-oriented knowledge base is constructed, knowledge support is provided for the content wind control technical service based on the knowledge graph, a reliable content wind control knowledge base is provided for language understanding and knowledge reasoning of a computer, and the accuracy and reliability of intelligent auditing are improved.
In an alternative embodiment, the ontology concept modeling includes:
acquiring ontology terms of the content wind control field according to preset corpus data; calculating word embedding characteristics of ontology terms in the content wind control field, and performing multistage clustering on the word embedding characteristics; and modeling the ontology concept of the content wind control field into characters, institutions, events and field feature word tables according to the characteristics of the content wind control and 5W elements of the media content.
Wherein the 5W element includes when (when), where (where), what (what), what (why), and what (who). Specifically, the word embedding feature can be used in the embodiment, the word embedding feature of each term is calculated through a direct Skip-Gram (DSG) algorithm, and the word embedding feature of the term is clustered in multiple stages by the k-means algorithm. As shown in fig. 2, the content wind control domain ontology concept is modeled as follows in combination with the characteristics of the content wind control and the 5W element of the media content: persona, institution, event, and domain feature vocabulary.
In an alternative embodiment, the ontology modeling includes:
when the relation among ontology terms in the content wind control field is a hierarchical relation, extracting by adopting a template preset by an expert and a multi-strategy mode based on language rules and a clustering method; when the relation between the ontology terms in the content wind control field is a non-hierarchical relation, the corpus data is analyzed by adopting a natural language processing technology, the core verbs in each sentence are identified, the terms adjacent to the core verbs are found by combining the context, and the relation between the two terms is constructed.
The relation among the ontology terms is divided into a hierarchical relation and a non-hierarchical relation. The extraction of the hierarchical relationship can be performed by adopting a template preset by an expert, a multi-strategy mode based on language rules and a clustering method, for example, the relationship between a person and an organization is divided into: the relationship between the people is: relatives, colleagues/upper and lower levels, etc. The extraction of the non-hierarchical relationship adopts a deep natural language processing technology to carry out syntactic analysis and semantic dependency analysis on the language, identifies the core verb in each sentence, then combines the context, searches the term adjacent to the core word and constructs the relationship between the two terms.
In an alternative embodiment, the entity relationship extraction includes:
giving a sentence, and sending the sentence into an encoder to obtain a corresponding word vector; the hidden coding of the entity information is realized by simulating the importance degree and the relativity between word vectors, and the average pooling operation is added to obtain the entity embedded characteristics of sentences; and splicing the entity embedding vector and the word vector, and classifying the entity embedding vector and the word vector through a neural network so as to obtain entity relation expression of the whole sentence.
Specifically, the entity relation extraction task in the sentence is based on hidden layer embedding of the language model, so that sentence-level text classification is realized. As shown in FIG. 3, a fifth sentence is given as S n First, S is n Feeding into Roberta encoder to obtain corresponding word vector n is the number of words of the sentence and d is the vector dimension. Entity relation depends on priori knowledge such as entity category, position sequence and the like, the embodiment in sentences is the association degree between words, therefore, an entity information coding layer based on an attention mechanism is designed behind a hidden layer, hidden coding of entity information is realized by simulating the importance degree of word vectors and the correlation between word vectors, and an average pooling operation is added to obtain the entity embedded characteristics of the whole sentence>AttNet means that self-care mechanisms are used to obtain embedded information between hidden layer vectors. Embedding the generated entity in a vector->Embedding +.>Splicing and classifying by neural network to obtain relation expression of whole sentence>Wherein σ is the sigmod activation function, a threshold ε is set, when +.>Determining the sentence inclusion relation r i I.e. occurs.
In an alternative embodiment, the entity extraction includes:
obtaining a continuous representation of the relationship prompt information according to the entity relationship representation; and fusing the continuous representation with the word vector, identifying the entity by using the conditional random field, and obtaining the output of each word in the entity classification stage.
Specifically, as shown in FIG. 3, the entity relationship is extractedRelationships for segment acquisitionConverting into one-hot vector and re-parameterizing the same by using a multi-layer perceptron to obtain a continuous representation P of relationship cues r . Prompting relation information P of model r And word vector->Fusion is performed through a transducer network module by combining P r Calculation K of attention change by stitching with K, V vector in network p 、V p Synchronous updating prompt parameter P in training process r And an attention weight matrix. Then using conditional random field to identify entity and obtain output Y of every word in entity classification stage n I.e., the probability that the token contained in the current input is an entity of some sort.
Example two
As shown in fig. 4, the embodiment of the present application provides a method for constructing a content wind-controlled knowledge base, which is a specific embodiment, and is used for describing the scheme of the present invention in detail, as shown in fig. 4, and specifically includes the following steps:
step S21, data acquisition and processing.
In the embodiment, based on the main stream media official netpage, the mode of deep depth from point to surface and layer to layer is adopted, and the HTML mark language is deeply analyzed through network link, so that the HTML mark content is acquired and analyzed at fixed time. The preprocessing of the original data after the data acquisition comprises the steps of de-duplication of multi-source data, removal of labels and special characters in the text and the like. In the feature extraction process, the embodiment comprehensively utilizes the topic relevance of the words to calculate the keyword weight, further extracts text features, combines a related similarity algorithm to obtain the data semantic similarity degree, integrates a rapid clustering algorithm to obtain the final semantic similarity, realizes the final result of data deduplication, and obtains the final corpus data.
And S22, modeling the ontology in the content wind control field.
An ontology is an important knowledge base that represents basic terms and relationships of vocabulary for a topic area. The content wind control field ontology is a system which comprises content wind control terms and canonical relations among the terms and descriptions. The content wind control field ontology modeling comprises the following parts:
the body terms are defined, the body terms are extracted by adopting a multi-strategy fusion method, the language data are subjected to word segmentation based on stop words, number words, graduated words, date and place nouns, part-of-speech analysis and part-of-speech analysis after word segmentation, named entity recognition is carried out on low-frequency personal names, manual screening key words and other element design field term filtering algorithms, initial terms are subjected to multi-round filtering, words which have no obvious meaning, disordered grammar structures or close semantics in terms are filtered, and finally the body terms in the content wind control field are obtained.
In the embodiment, word embedding characteristics are adopted, the word embedding characteristics of each term are calculated through a direct Skip-Gram (DSG) algorithm, and multistage clustering is carried out on the word embedding characteristics of the terms by assisting a k-means algorithm. As shown in fig. 2, the content wind control domain ontology concept is modeled as follows in combination with the characteristics of the content wind control and the 5W element of the media content: persona, institution, event, and domain feature vocabulary.
The ontology relation modeling is carried out, and the relation among the ontology terms is divided into a hierarchical relation and a non-hierarchical relation. The extraction of the hierarchical relationship adopts a template preset by an expert, and adopts a multi-strategy mode extraction based on language rules and a clustering method, for example, the relationship between a person and a mechanism is divided into: the relationship between the people is: relatives, colleagues/upper and lower levels, etc. The extraction of the non-hierarchical relationship adopts a deep natural language processing technology to carry out syntactic analysis and semantic dependency analysis on the language, identifies the core verb in each sentence, then combines the context, searches the term adjacent to the core word and constructs the relationship between the two terms.
And S23, extracting the content wind control knowledge.
The embodiment provides a two-stage entity relation extraction method, which comprises the steps of firstly adopting a fine-tuning pre-training language model of a classification task to extract entity relation, and then fusing entity relation information and the pre-training language model to extract the entity. The implementation steps are as follows:
the entity relation extraction task in the sentence is based on the hidden layer embedding of the language model, and the text classification of the sentence level is realized. As shown in FIG. 3, a sentence S is given n First, S is n Feeding into Roberta encoder to obtain corresponding word vectorn is the number of words of the sentence and d is the vector dimension. Entity relation depends on priori knowledge such as entity category, position sequence and the like, the embodiment in sentences is the association degree between words, so an entity information coding layer based on an attention mechanism is designed behind a hidden layer, hidden coding of entity information is realized by simulating the importance degree of word vectors and the correlation between word vectors, and an average pooling operation is added to obtain entity embedded characteristics of the whole sentenceAttNet means that self-care mechanisms are used to obtain embedded information between hidden layer vectors. Embedding the generated entity in a vector->Embedding +.>Splicing and classifying by neural network to obtain relation expression of whole sentence> Wherein σ is the sigmod activation function, a threshold ε is set, when +.>Determining the sentence inclusion relation r i
Entity extraction, namely extracting the relationship obtained in the entity relationship extraction stageConverting into one-hot vector and re-parameterizing the same by using a multi-layer perceptron to obtain a continuous representation P of relationship cues r . Prompting relation information P of model r And word vector->Fusion is performed through a transducer network module by combining P r Calculation K of attention change by stitching with K, V vector in network p 、V p Synchronous updating prompt parameter P in training process r And an attention weight matrix. Then using conditional random field to identify entity and obtain output Y of every word in entity classification stage n I.e., the probability that the token contained in the current input is an entity of some sort.
And S24, constructing a content wind control knowledge base.
In the embodiment, an RDF storage system is adopted, content wind control knowledge is stored as graph data, and a relational database is used as a bottom storage scheme.
According to the embodiment, data cleaning, structured extraction and knowledge mining are automatically carried out, a content wind control knowledge base is finally constructed, the content wind control knowledge base comprises a bottom layer content wind control field term base and relational knowledge among terms, and support service can be provided for media content wind control.
Example III
Fig. 5 shows a schematic structural diagram of a content wind-controlled knowledge base construction device according to a third embodiment of the present application. As shown in fig. 5, the apparatus includes: a modeling module 31, a knowledge extraction module 32, and a knowledge base construction module 33; wherein,
the modeling module 31 is configured to model the body of the content wind control domain according to preset corpus data; wherein the modeling comprises ontology concept modeling and ontology relationship modeling;
the knowledge extraction module 32 is configured to perform content wind control knowledge extraction according to the modeled ontology concepts and ontology relationships; the content wind control knowledge extraction comprises entity relation extraction and entity extraction;
the knowledge base construction module 33 is configured to construct a content wind control knowledge base according to the extracted entity relationships and entities.
Further, the modeling module 31 is specifically configured to: acquiring ontology terms of the content wind control field according to preset corpus data; calculating word embedding characteristics of the ontology term in the content wind control field, and performing multistage clustering on the word embedding characteristics; and modeling the ontology concept of the content wind control field into characters, institutions, events and field feature word tables according to the characteristics of the content wind control and 5W elements of the media content.
Further, the modeling module 31 is specifically configured to: when the relation among ontology terms in the content wind control field is a hierarchical relation, extracting by adopting a template preset by an expert and a multi-strategy mode based on language rules and a clustering method; when the relation between the ontology terms in the content wind control field is a non-hierarchical relation, the corpus data is analyzed by adopting a natural language processing technology, the core verbs in each sentence are identified, the terms adjacent to the core verbs are found by combining the context, and the relation between the two terms is constructed.
Further, the knowledge extraction module 32 is specifically configured to: giving a sentence, and sending the sentence into an encoder to obtain a corresponding word vector; the hidden coding of the entity information is realized by simulating the importance degree and the relativity between the word vectors, and the entity embedding characteristics of sentences are obtained by adding an average pooling operation; and splicing the entity embedding vector and the word vector, and classifying through a neural network to obtain entity relation representation of the whole sentence.
Further, the knowledge extraction module 32 is specifically configured to: obtaining a continuous representation of relationship prompt information according to the entity relationship representation; and fusing the continuous representation with the word vector, identifying the entity by using a conditional random field, and obtaining the output of each word in the entity classification stage.
Further, the hierarchical relationship includes a relationship between a person and an organization, or a relationship between a person and a person; the relationship between the person and the organization is the establishment/establishment, the wilting and the visit/visit, and the relationship between the person and the person is the relatives, colleagues/upper and lower levels.
Further, the knowledge base construction module 33 is specifically configured to: and a resource framework RDF storage system is adopted, the content wind control knowledge is stored as graph data, and a relational database is used as a bottom layer for storage.
The content wind control knowledge base construction device in this embodiment is used to execute the content wind control knowledge base construction methods in the first to second embodiments, and the working principle is similar to the technical effect, and is not repeated here.
Example IV
A fourth embodiment of the present application provides a non-volatile computer storage medium, where at least one executable instruction is stored in the computer storage medium, where the computer executable instruction may perform the method for building a content wind-controlled knowledge base in any of the foregoing method embodiments.
Example five
Fig. 6 shows a schematic structural diagram of an electronic device according to a fifth embodiment of the present application. The specific embodiments of the present application are not limited to specific implementations of electronic devices.
As shown in fig. 6, the electronic device may include: a processor 402, a communication interface (Communications Interface) 404, a memory 406, and a communication bus 408.
Wherein: processor 402, communication interface 404, and memory 406 communicate with each other via communication bus 408. A communication interface 404 for communicating with network elements of other devices, such as clients or other servers. Processor 402, for executing program 410, may specifically perform the relevant steps in the method embodiments described above.
In particular, program 410 may include program code including computer-operating instructions.
The processor 402 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present application. The one or more processors included in the electronic device may be the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.
Memory 406 for storing programs 410. Memory 406 may comprise high-speed RAM memory or may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
Program 410 may be specifically configured to cause processor 402 to perform the content wind-controlled knowledge base construction method of any of the method embodiments described above.
The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, embodiments of the present application are not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present application as described herein, and the above description of specific languages is provided for disclosure of preferred embodiments of the present application.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the present application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the above description of exemplary embodiments of the application, various features of embodiments of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed application requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.
Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the present application and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.
Various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functionality of some or all of the components according to embodiments of the present application may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present application may also be embodied as an apparatus or device program (e.g., computer program and computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present application may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specifically stated.

Claims (8)

1. The method for constructing the content wind control knowledge base is characterized by comprising the following steps of:
modeling the content wind control domain ontology according to preset corpus data; wherein the modeling comprises ontology concept modeling and ontology relationship modeling;
content wind control knowledge extraction is carried out according to the modeled ontology concepts and ontology relations; the content wind control knowledge extraction comprises entity relation extraction and entity extraction;
constructing a content wind control knowledge base according to the extracted entity relation and the entity;
wherein the entity relationship extraction includes:
designing entity information editing based on attention mechanism behind hidden layerThe code layer realizes the hidden coding of entity information by simulating the importance degree and the relativity of word vectors, and adds the average pooling operation to obtain the entity embedding of the whole sentenceWherein, attNet means to use self-attention mechanism to obtain the embedded information between hidden layer vectors; wherein a sentence is given as S n First, S is n Feeding into the Roberta encoder to obtain the corresponding original sentence embedding +.> n is the number of words of the sentence, d is the vector dimension;
embedding the generated entityEmbedding +.>Splicing and classifying by neural network to obtain relation expression of whole sentence>Wherein σ is the sigmod activation function, a threshold ε is set, when +.>Determining the sentence inclusion relation r i
Wherein the entity extraction includes:
the relation obtained in the entity relation extraction stageConversion to one-hot vector and use of multi-layer perceptron pairsIt is re-parameterized to obtain a continuous representation P of relationship cues r
Will P r And (3) withFusion is performed through a transducer network module by combining P r Calculation K of attention change by stitching with K, V vector in network p 、V p Synchronous update P in training process r And an attention weight matrix;
identifying the entity using conditional random field to obtain the output Y of each word in the entity classification stage n I.e., the probability that the token contained in the current input is an entity of some sort.
2. The method of claim 1, wherein the ontology concept modeling comprises:
acquiring ontology terms of the content wind control field according to preset corpus data;
calculating word embedding characteristics of the ontology term in the content wind control field, and performing multistage clustering on the word embedding characteristics;
and modeling the ontology concept of the content wind control field into characters, institutions, events and field feature word tables according to the characteristics of the content wind control and 5W elements of the media content.
3. The method of claim 1, wherein the ontology relationship modeling comprises:
when the relation among ontology terms in the content wind control field is a hierarchical relation, extracting by adopting a template preset by an expert and a multi-strategy mode based on language rules and a clustering method;
when the relation between the ontology terms in the content wind control field is a non-hierarchical relation, the corpus data is analyzed by adopting a natural language processing technology, the core verbs in each sentence are identified, the terms adjacent to the core verbs are found by combining the context, and the relation between the two terms is constructed.
4. The method of claim 3, wherein the hierarchical relationship comprises a relationship of people to institutions, or a relationship between people to people; the relationship between the person and the organization is the establishment/establishment, the wilting and the visit/visit, and the relationship between the person and the person is the relatives, colleagues/upper and lower levels.
5. The method according to any one of claims 1-4, wherein constructing a content-pneumatic knowledge base from the extracted entity relationships and entities comprises:
and a resource framework RDF storage system is adopted, the content wind control knowledge is stored as graph data, and a relational database is used as a bottom layer for storage.
6. A content wind-controlled knowledge base construction apparatus, comprising:
the modeling module is used for modeling the body of the content wind control field according to preset corpus data; wherein the modeling comprises ontology concept modeling and ontology relationship modeling;
the knowledge extraction module is used for carrying out content wind control knowledge extraction according to the modeled ontology concepts and ontology relations; the content wind control knowledge extraction comprises entity relation extraction and entity extraction;
the knowledge base construction module is used for constructing a content wind control knowledge base according to the extracted entity relation and the entity;
the knowledge extraction module is specifically configured to: an entity information coding layer based on an attention mechanism is designed behind the hiding layer, hidden coding of entity information is realized by simulating the importance degree and the relativity of word vectors, and an average pooling operation is added to obtain entity embedding of a whole sentence Wherein AttNet indicates that hidden layer orientation is obtained by using self-attention mechanismEmbedding information between the volumes; wherein a sentence is given as S n First, S is n Feeding into the Roberta encoder to obtain the corresponding original sentence embedding +.>n is the number of words of the sentence, d is the vector dimension; embedding the generated entity->Embedding +.>Splicing and classifying by neural network to obtain relation expression of whole sentence> Wherein σ is the sigmod activation function, a threshold ε is set, when +.>Determining the sentence inclusion relation r i The method comprises the steps of carrying out a first treatment on the surface of the The method comprises the steps of,
the relation obtained in the entity relation extraction stageConverting into one-hot vectors and re-parameterizing the one-hot vectors by using a multi-layer perceptron to obtain continuous representation P of relation prompt information r
Will P r And (3) withFusion is performed through a transducer network module by combining P r Calculation K of attention change by stitching with K, V vector in network p 、V p Synchronous update P in training process r And an attention weight matrix;
identifying the entity using conditional random field to obtain the output Y of each word in the entity classification stage n I.e., the probability that the token contained in the current input is an entity of some sort.
7. An electronic device, comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
the memory is configured to store at least one executable instruction, where the executable instruction causes the processor to perform the operations corresponding to the content wind-controlled knowledge base construction method according to any one of claims 1-5.
8. A computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the content wind-controlled knowledge base construction method of any one of claims 1-5.
CN202310094574.9A 2023-01-17 2023-01-17 Content wind control knowledge base construction method, device, equipment and storage medium Active CN116069948B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310094574.9A CN116069948B (en) 2023-01-17 2023-01-17 Content wind control knowledge base construction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310094574.9A CN116069948B (en) 2023-01-17 2023-01-17 Content wind control knowledge base construction method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116069948A CN116069948A (en) 2023-05-05
CN116069948B true CN116069948B (en) 2024-01-09

Family

ID=86179869

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310094574.9A Active CN116069948B (en) 2023-01-17 2023-01-17 Content wind control knowledge base construction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116069948B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110968700A (en) * 2019-11-01 2020-04-07 数地科技(北京)有限公司 Domain event map construction method and device fusing multi-class affairs and entity knowledge
CN111832307A (en) * 2020-07-09 2020-10-27 北京工业大学 Entity relationship extraction method and system based on knowledge enhancement
CN111930856A (en) * 2020-07-06 2020-11-13 北京邮电大学 Method, device and system for constructing domain knowledge graph ontology and data
CN112559766A (en) * 2020-12-08 2021-03-26 杭州互仲网络科技有限公司 Legal knowledge map construction system
CN114064918A (en) * 2021-11-06 2022-02-18 中国电子科技集团公司第五十四研究所 Multi-modal event knowledge graph construction method
CN114661856A (en) * 2020-12-23 2022-06-24 沈阳新松机器人自动化股份有限公司 Fusion map construction method
CN114780745A (en) * 2022-04-20 2022-07-22 北京明略昭辉科技有限公司 Method and device for constructing knowledge system, electronic equipment and storage medium
CN115292506A (en) * 2022-06-24 2022-11-04 北京百度网讯科技有限公司 Knowledge graph ontology construction method and device applied to office field
CN115309915A (en) * 2022-09-29 2022-11-08 北京如炬科技有限公司 Knowledge graph construction method, device, equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180232443A1 (en) * 2017-02-16 2018-08-16 Globality, Inc. Intelligent matching system with ontology-aided relation extraction

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110968700A (en) * 2019-11-01 2020-04-07 数地科技(北京)有限公司 Domain event map construction method and device fusing multi-class affairs and entity knowledge
CN111930856A (en) * 2020-07-06 2020-11-13 北京邮电大学 Method, device and system for constructing domain knowledge graph ontology and data
CN111832307A (en) * 2020-07-09 2020-10-27 北京工业大学 Entity relationship extraction method and system based on knowledge enhancement
CN112559766A (en) * 2020-12-08 2021-03-26 杭州互仲网络科技有限公司 Legal knowledge map construction system
CN114661856A (en) * 2020-12-23 2022-06-24 沈阳新松机器人自动化股份有限公司 Fusion map construction method
CN114064918A (en) * 2021-11-06 2022-02-18 中国电子科技集团公司第五十四研究所 Multi-modal event knowledge graph construction method
CN114780745A (en) * 2022-04-20 2022-07-22 北京明略昭辉科技有限公司 Method and device for constructing knowledge system, electronic equipment and storage medium
CN115292506A (en) * 2022-06-24 2022-11-04 北京百度网讯科技有限公司 Knowledge graph ontology construction method and device applied to office field
CN115309915A (en) * 2022-09-29 2022-11-08 北京如炬科技有限公司 Knowledge graph construction method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN116069948A (en) 2023-05-05

Similar Documents

Publication Publication Date Title
Arora et al. Character level embedding with deep convolutional neural network for text normalization of unstructured data for Twitter sentiment analysis
US9613024B1 (en) System and methods for creating datasets representing words and objects
KR101339103B1 (en) Document classifying system and method using semantic feature
KR101136007B1 (en) System and method for anaylyzing document sentiment
JP6676110B2 (en) Utterance sentence generation apparatus, method and program
CN103229223A (en) Providing answers to questions using multiple models to score candidate answers
JP2009521029A (en) Method and system for automatically generating multilingual electronic content from unstructured data
MXPA04010820A (en) System for identifying paraphrases using machine translation techniques.
Rodrigues et al. Advanced applications of natural language processing for performing information extraction
Al-Zoghby et al. Semantic relations extraction and ontology learning from Arabic texts—a survey
CN115098706A (en) Network information extraction method and device
Qudar et al. A survey on language models
Palagin et al. Distributional semantic modeling: A revised technique to train term/word vector space models applying the ontology-related approach
Vaissnave et al. Modeling of automated glowworm swarm optimization based deep learning model for legal text summarization
CN112800244A (en) Method for constructing knowledge graph of traditional Chinese medicine and national medicine
Alruily Using text mining to identify crime patterns from arabic crime news report corpus
Rao et al. Enhancing multi-document summarization using concepts
Phan et al. Applying skip-gram word estimation and SVM-based classification for opinion mining Vietnamese food places text reviews
CN116069948B (en) Content wind control knowledge base construction method, device, equipment and storage medium
Lee Natural Language Processing: A Textbook with Python Implementation
Klang et al. Linking, searching, and visualizing entities in wikipedia
Ramasubramanian et al. ES2Vec: Earth science metadata keyword assignment using domain-specific word embeddings
Su et al. Automatic ontology population using deep learning for triple extraction
Varga et al. LELA-A natural language processing system for Romanian tourism
Ledeneva et al. Recent advances in computational linguistics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant