WO2020253350A1 - 网络内容发布的审核方法、装置、计算机设备及存储介质 - Google Patents

网络内容发布的审核方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2020253350A1
WO2020253350A1 PCT/CN2020/085582 CN2020085582W WO2020253350A1 WO 2020253350 A1 WO2020253350 A1 WO 2020253350A1 CN 2020085582 W CN2020085582 W CN 2020085582W WO 2020253350 A1 WO2020253350 A1 WO 2020253350A1
Authority
WO
WIPO (PCT)
Prior art keywords
basic
content
preset
word segmentation
published
Prior art date
Application number
PCT/CN2020/085582
Other languages
English (en)
French (fr)
Inventor
夏新
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2020253350A1 publication Critical patent/WO2020253350A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • This application relates to the field of natural language processing, and in particular to a review method, device, computer equipment, and storage medium for publishing network content.
  • keyword detection is mainly used for auditing.
  • the inventor realizes that this auditing method can only perform matching based on preset keywords, and then judges whether the published content is standardized, which is limited by the setting of keywords. Moreover, it is easy for users to avoid keywords to publish bad content, which makes the review intelligence and accuracy rate of online published content low.
  • the embodiments of the present application provide a method, device, computer equipment and storage medium for reviewing network content publishing to solve the current keyword matching method for reviewing network content publishing, which leads to the problems of low intelligence and low accuracy of review. .
  • a review method for network content publishing including:
  • the current user information is matched with each user information in the preset list type database to determine the user type corresponding to the current user information, wherein the list type database includes each user information and the user information corresponding User type;
  • the comprehensive score is compared with a preset score threshold. If the comprehensive score is greater than the preset score threshold, it is confirmed that the content to be published is legal, the content to be published is published, and the review is sent to the client Message passed.
  • a review device for publishing network content including:
  • the request receiving module is configured to obtain the current user information and the content to be published included in the review request if the review request for network content publishing sent by the client is received;
  • the type matching module is used to match the current user information with each user information in the preset list type database to determine the user type corresponding to the current user information, wherein the list type database includes each user information The user type corresponding to the user information;
  • the content analysis module is configured to, if the user type corresponding to the current user information is an ordinary user, analyze the content to be published according to a preset sentence division method to obtain each basis contained in the content to be published Statement
  • the semantic recognition module is used to perform semantic recognition on each of the basic sentences in a natural language semantic recognition method, and obtain the semantic score corresponding to each of the basic sentences;
  • the comprehensive score module is used to determine the comprehensive score of the content to be published according to the semantic score of each basic sentence
  • the result determination module is configured to compare the comprehensive score with a preset score threshold. If the comprehensive score is greater than the preset score threshold, confirm that the content to be published is legal, publish the content to be published, and send it to The client sends an approved message.
  • a computer device includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, a method for reviewing network content publishing is implemented, Including: when receiving a review request for publishing network content from a client, obtaining the current user information and content to be published contained in the review request, matching the current user information with each user information in the preset list type database, and confirming The user type corresponding to the current user information.
  • the content to be published is analyzed according to the preset sentence division method, and each basic sentence contained in the content to be published is obtained, and then natural
  • the method of language semantic recognition is to perform semantic recognition on each basic sentence, and obtain the semantic score corresponding to each basic sentence. Then, according to the semantic score of each basic sentence, determine the comprehensive score of the content to be published, and finally combine the comprehensive score with the prediction Set a scoring threshold for comparison. When the comprehensive score is greater than the preset scoring threshold, confirm that the content to be published is legal, publish the content to be published, and send an approved message to the client.
  • a computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, a method for reviewing network content publishing is realized, including: receiving a client sending a network content publishing In the review request, obtain the current user information and content to be published in the review request, match the current user information with each user information in the preset list type database, and determine the user type corresponding to the current user information.
  • the user type corresponding to the information is an ordinary user
  • the content to be published is analyzed according to the preset sentence division method to obtain each basic sentence contained in the content to be published, and then natural language semantic recognition is used to identify each basic sentence Perform semantic recognition to obtain the semantic score corresponding to each basic sentence, and then determine the comprehensive score of the content to be published according to the semantic score of each basic sentence, and finally compare the comprehensive score with the preset score threshold.
  • the score threshold confirm that the content to be published is legal, publish the content to be published, and send an approved message to the client.
  • the method, device, computer equipment, and storage medium for reviewing network content publishing provided by the embodiments of this application realize intelligent semantic recognition of network content, and verify whether the network content publishing is reasonable according to the identified semantics, thereby improving network content The degree of intelligence and accuracy of published audits.
  • FIG. 1 is a schematic diagram of an application environment of a method for reviewing network content publishing provided by an embodiment of the present application
  • FIG. 2 is an implementation flowchart of a method for reviewing network content publishing provided by an embodiment of the present application
  • Figure 3 is a flow chart of reviewing non-ordinary users in the reviewing method for network content publishing provided by an embodiment of the present application
  • FIG. 4 is a flowchart of the implementation of step S40 in the method for reviewing network content publishing provided by an embodiment of the present application;
  • FIG. 5 is a flowchart of the implementation of step S41 in the method for reviewing network content publishing provided by an embodiment of the present application
  • Fig. 6 is a schematic diagram of a verification device for network content publishing provided by an embodiment of the present application.
  • Fig. 7 is a schematic diagram of a computer device provided by an embodiment of the present application.
  • FIG. 1 shows an application environment of a method for reviewing network content publishing provided by an embodiment of the present application.
  • the method for reviewing network content publishing is applied in review scenarios of network content publishing in network forums, network live broadcasts or other types of network communities.
  • the recording scene includes the client, server and management. Among them, the server and the client, the server and the management are connected through the network, the client sends the audit request of the network content release to the server, and the server After the review request is obtained, the user type is determined, and the review method is determined according to the user type.
  • the user type is an ordinary user, the content to be published is obtained, and semantic analysis is performed to obtain the semantic score of the content to be published, and then determine the content to be published.
  • the legality of the published content, and when it is illegal, the corresponding prompt information will be sent to the management terminal.
  • the client and the management terminal may specifically be, but are not limited to, smart terminal devices such as mobile phones, tablet computers, and personal computers (Personal Computer, PC), and the server may specifically be implemented by an independent server or a server cluster composed of multiple servers.
  • FIG. 2 shows a method for reviewing network content publishing provided by an embodiment of the present application. The method is applied to the server in FIG. 1 as an example for description. The details are as follows:
  • the client when the user communicates on the forum through the client, he first edits the content to be published, and after clicking the submit button of the client, the client sends a review request containing the user information and the content to be published to the server, and the server transmits it via the network
  • the agreement receives the user information and content to be published contained in the review request.
  • the user information includes but is not limited to user account information, etc.
  • the server determines the user type through the user account information.
  • the review method corresponding to the user type is used to review the published content. , In order to improve the review efficiency of network content publishing.
  • the content to be published is text information, link information, image information, and video information that is edited by the user on the client and used to upload forums or other online communities, and used to interact with other online users.
  • the network transmission protocol includes but is not limited to: Internet Control Message Protocol (ICMP), Address Resolution Protocol (ARP Address Resolution Protocol, ARP), File Transfer Protocol (File Transfer Protocol, FTP), etc.
  • ICMP Internet Control Message Protocol
  • ARP Address Resolution Protocol ARP
  • FTP File Transfer Protocol
  • S20 Match the current user information with each user information in the preset list type database to determine the user type corresponding to the current user information, where the list type database includes each user information and the user type corresponding to the user information.
  • the server stores a preset list type database.
  • the preset list type database contains the user information of all registered users and the user type corresponding to each user information.
  • the preset list is searched by traversal query.
  • the type database is queried to realize the user type judgment on the user information obtained in step S10, and the user type corresponding to the user information is obtained.
  • the user types contained in the preset list type database may include: whitelisted users, blacklisted users, and ordinary user types.
  • the distinction between different user types is based on the credit rating of the user. For example, the management personnel list The user’s corresponding credit rating is relatively high, and they are generally classified as whitelisted users. Users who have been suspected of illegal operations for many times should be in the normal order of the online community. The corresponding credit rating is low. When the credit rating is reduced to a certain level, they will be listed as black. List of user types.
  • the user type is the user information of ordinary users, and the corresponding audit request needs to be further intelligently evaluated, and the audit result is determined according to the evaluation result.
  • the content to be published is analyzed according to a preset sentence division method to obtain each basic sentence contained in the content to be published.
  • the preset sentence division method may be through regular matching of preset delimiters, and then use the position where the preset delimiter is matched as the delimiting point, and segment the content to be published to obtain the Each basic sentence contained in the publication content.
  • the preset separators include but are not limited to: paragraph characters, line breaks, punctuation marks, etc., which can be specifically set according to actual needs and are not limited here.
  • each of the basic sentences is semantically recognized, and the semantics corresponding to each basic sentence is scored according to preset scoring conditions to obtain the semantic score of each basic sentence.
  • natural language semantic recognition Natural Language Processing, NLP
  • AI artificial intelligence
  • Text to speech/Speech synthesis Speech recognition
  • Chinese word segmentation Chinese word segmentation
  • Part-of-speech tagging Syntax analysis
  • Parsing text classification
  • Text classification Text categorization
  • information retrieval Information retrieval
  • automatic summarization Automatic summarization
  • text proofing text proofing
  • S50 Determine the comprehensive score of the content to be published according to the semantic score of each basic sentence.
  • the semantic score of each basic sentence is weighted and summarized through a preset weighting method to obtain a comprehensive score of the content to be published.
  • the preset weighting method can be set according to actual needs, for example, different weighting coefficients are set for semantic scores in different ranges.
  • S60 Compare the comprehensive score with the preset score threshold. If the comprehensive score is greater than the preset score threshold, confirm that the content to be published is legal, publish the content to be published, and send an approved message to the client.
  • the server presets a scoring threshold, and compares the comprehensive score with the preset scoring threshold. When the comprehensive score is greater than the preset scoring threshold, confirms that the content to be published is legal, publishes the content to be published, and sends the review to the client Message passed.
  • the comprehensive score is greater than or equal to the preset score threshold, it is confirmed that the content to be published may be suspected of violating regulations, and the content to be published will be rejected, and the client will be notified that the review has not passed, and the content to be published
  • the content review request is recorded for subsequent management personnel to manage.
  • the current user information and content to be published contained in the review request are obtained, and the current user information is combined with each user information in the preset list type database.
  • the comprehensive score is greater than the preset score threshold, confirm that the content to be published is legal, publish the content to be published, and send an approved message to the client to realize intelligent evaluation of network content Semantic recognition, and based on the identified semantics to review whether the network content publishing is reasonable, which improves the intelligence and accuracy of reviewing network content publishing.
  • the method for reviewing network content publishing further includes:
  • the content to be published is directly published.
  • the preset list type database is queried by traversal query, and it is determined that the user type corresponding to the current user information is a blacklist user, it is determined that there is no need to review the content to be published containing semantic information, and directly Delete the content to be published, and send a message of disapproval to the client.
  • step S70 and step S80 are not necessarily executed sequentially, and they can be executed in parallel, which is not limited here.
  • the user types are whitelisted users and blacklisted users, and quick review operations are performed in a preset manner, without the need for semantic recognition of the content to be published for users of these two user types, which improves the network content. Release review efficiency.
  • the following uses a specific embodiment to perform the semantic recognition of each basic sentence by using the natural language semantic recognition method mentioned in step S40 to obtain each basic sentence
  • the specific implementation method of the corresponding semantic score will be described in detail.
  • FIG. 4 shows a specific implementation process of step S40 provided in an embodiment of the present application, which is detailed as follows:
  • S41 Perform word segmentation processing on the basic sentence through a preset word segmentation method to obtain the basic word segmentation contained in the basic sentence.
  • each basic sentence obtained in step S30 is subjected to word segmentation processing to obtain the basic word segmentation contained in each basic sentence.
  • the preset word segmentation methods include but are not limited to: third-party word segmentation tools or word segmentation algorithms, etc.
  • common third-party word segmentation tools include, but are not limited to: Stanford NLP word segmentation, ICTCLAS word segmentation system, ansj word segmentation tool and HanLP Chinese word segmentation tool, etc.
  • word segmentation algorithms include but are not limited to: Maximum Matching (MM) algorithm, Reverse Direction Maximum Matching Method (RMM) algorithm, Bi-directction Matching method, BM) algorithm, Hidden Mark Markov Model (Hidden Markov Model, HMM) and N-gram model, etc.
  • the basic word segmentation is extracted by word segmentation. On the one hand, it can filter out some meaningless words in the effective basic sentence. On the other hand, it is also beneficial to use these basic word segmentation to generate word vectors.
  • S42 Convert the basic word segmentation into a word vector, and cluster the word vector through a preset clustering algorithm to obtain the cluster center corresponding to each basic sentence.
  • language representation mainly refers to the formal or mathematical description of language, so that language can be expressed in a computer and can be processed automatically by computer programs.
  • the word vector referred to in the embodiment of this application is to express a basic word segmentation in the form of a vector.
  • the word vector is used to transform each basic word segment to obtain the word vector corresponding to each basic word segment, and then the word vector is clustered through a preset clustering algorithm to obtain the corresponding to each basic word segment.
  • the clustering centers of the word vectors of, and then the clustering centers corresponding to the basic word segmentation in the same basic sentence are continuously clustered to obtain the clustering centers corresponding to the basic sentence.
  • clustering algorithm is also called cluster analysis. It is a statistical analysis method for the classification of samples or indicators. It is also an important algorithm for data mining.
  • Clustering algorithms include but are not limited to: K-Means ) Clustering algorithm, mean shift clustering algorithm, density-based clustering (Density-Based Spatial Clustering of Applications with Noise, DBSCAN) method, maximum expected clustering based on Gaussian mixture model, agglomerative hierarchical clustering and graph group detection ( Graph Community Detection) algorithm, etc.
  • the K-Means clustering algorithm is adopted to cluster the word vectors corresponding to each basic word segment to determine the classification corresponding to each basic word segment, and then to cluster the basic sentences , Get the cluster center corresponding to the basic sentence.
  • the server pre-stores preset semantic vectors representing designated semantics, and each preset semantic vector corresponds to a preset semantic score.
  • the cluster center corresponding to the basic sentence and these predictions are calculated separately.
  • Set the distance of the semantic vector use the preset word meaning vector corresponding to the minimum distance as the target vector, and use the semantic score corresponding to the target vector as the semantic score of the basic sentence.
  • the scoring parameter can be calculated according to the distance between the basic sentence and the target vector, and the semantic score of the basic sentence can be determined according to the scoring parameter and the semantic score corresponding to the target vector.
  • the basic sentence is segmented through a preset word segmentation method to obtain the basic word segmentation contained in the basic sentence, and then the basic word segmentation is converted into a word vector, and the word vector is processed through a preset clustering algorithm Perform clustering to obtain the cluster center corresponding to each basic sentence.
  • a preset word segmentation method to obtain the basic word segmentation contained in the basic sentence
  • the basic word segmentation is converted into a word vector
  • the word vector is processed through a preset clustering algorithm Perform clustering to obtain the cluster center corresponding to each basic sentence.
  • the target vector uses the semantic score corresponding to the target vector as the semantic score corresponding to the basic sentence, which realizes the semantic score of the basic sentence, and improves the intelligence and efficiency of the review.
  • the following uses a specific embodiment to perform word segmentation processing on the basic sentence through the preset word segmentation method mentioned in step S41 to obtain the basic word segmentation contained in the basic sentence
  • the specific implementation method is described in detail.
  • FIG. 5 shows a specific implementation process of step S41 provided by an embodiment of the present application, which is detailed as follows:
  • S411 Obtain a preset training corpus, and use the N-gram model to analyze the preset training corpus to obtain word sequence data of the preset training corpus.
  • the training corpus is used to evaluate the basic sentences in the natural language, and the corpus obtained by training using related corpus, by using the N-gram model to perform statistical analysis on each corpus in the preset training corpus, Obtain the number of times that one corpus H appears after another corpus I in the preset training corpus, and then obtain the word sequence data of the word sequence composed of "corpus I + corpus H".
  • the content of the training corpus in the embodiment of the present application includes, but is not limited to: professional information corresponding to topics of forums or online communities, online corpus, general corpus, etc.
  • Corpus refers to a large-scale electronic text library that has been scientifically sampled and processed.
  • Corpus is the basic resource for linguistic research and the main resource for empirical language research methods. It is used in lexicography, language teaching, traditional language research, and statistical or case-based research in natural language processing.
  • Corpus that is, language materials, Corpus is the content of linguistic research and the basic unit of corpus.
  • the preset training corpus is a corpus in the field of "current affairs" by crawling popular web topics and current affairs news through a web crawler.
  • the word sequence refers to a sequence formed by combining at least two corpora in a certain order
  • the word sequence frequency refers to the ratio of the number of occurrences of the word sequence to the number of occurrences of word segmentation in the entire corpus.
  • the word segmentation here refers to It is a word sequence obtained by combining consecutive word sequences according to a preset combination method. For example, the number of occurrences of a certain word sequence "I love tomatoes" in the entire corpus is 100 times, and the sum of the number of occurrences of all word segments in the entire corpus is 100000 times, then the word sequence frequency of the word sequence "I love tomatoes" is 0.0001 .
  • the N-gram model is a commonly used language model in the semantic recognition of large vocabulary continuous text.
  • the sentence with the greatest probability can be calculated, so as to realize the automatic conversion to Chinese characters without the user's manual selection, which improves the accuracy of word sequence determination.
  • a preset training corpus is obtained, and the N-gram model is used to analyze the preset training corpus to obtain words in the preset training corpus.
  • the sequence data process can be carried out before the review, and the obtained word sequence data can be stored.
  • the word sequence data can be directly called.
  • S412 Perform word segmentation analysis on the basic sentence to obtain M word segmentation sequences.
  • each basic sentence has a different sentence segmentation method, and the understood sentence may have differences.
  • the server obtains the basic sentence after obtaining the composition of the M word segmentation sequence of the basic sentence.
  • M is the total number of all possible word segmentation sequences.
  • each word segmentation sequence is a result obtained by dividing a basic sentence, and the obtained word sequence contains at least two word segmentation.
  • a basic sentence is "Today is really hot”, and the basic sentence is analyzed to obtain the word segmentation sequence A as: “today”, “true”, and “hot”, and the word segmentation sequence B is: "Jin”, “Innocent”, “Hot”, etc.
  • S413 For each word segmentation sequence, calculate the occurrence probability of each word segmentation sequence according to the word sequence data of the preset training corpus to obtain the occurrence probability of M word segmentation sequences.
  • the occurrence probability of each word segmentation sequence is calculated to obtain the occurrence probability of M word segmentation sequences.
  • the Markov hypothesis theory can be used to calculate the occurrence probability of the word segmentation sequence: the appearance of the Y-th word is only related to the previous Y-1 words, and is not related to any other words.
  • the probability of the entire sentence is the probability of the occurrence of each word product.
  • P(T) is the probability of the entire sentence
  • W 1 W 2 ... W Y-1 ) is the probability that the Y- th participle appears after the word sequence composed of Y-1 participles.
  • S414 From the occurrence probabilities of M word segmentation sequences, select the word segmentation sequence corresponding to the occurrence probability that reaches the preset probability threshold as the target word segmentation sequence, and use each word segmentation in the target word segmentation sequence as the basic word segmentation contained in the basic sentence .
  • an occurrence probability is obtained through the calculation in step S413, and the occurrence probabilities of a total of M word segmentation sequences are obtained.
  • the occurrence probabilities of the M word segmentation sequences are respectively compared with the preset probability threshold, and the selection is greater than Or the occurrence probability equal to the preset probability threshold is used as the effective occurrence probability, and then the word segmentation sequence corresponding to the effective occurrence probability is found, and these word segmentation sequences are used as the target word segmentation sequence.
  • the word segmentation sequence whose occurrence probability does not meet the requirements is filtered out, so that the selected target word segmentation sequence is closer to the meaning expressed in natural language, and the accuracy of semantic recognition is improved.
  • the content to be published is determined to be content that does not conform to the specification.
  • the review failure is regarded as the review result, and the The client sends a reminder message of "Please abide by the rules of online speech and be a civilized netizen".
  • the preset number is 5 . After sorting the effective occurrence probabilities, select the first 5 effective occurrence probabilities in the ranking, and then obtain the word segmentation sequence corresponding to these 5 occurrence probabilities as the target word segmentation sequence.
  • the word segmentation sequence corresponding to the maximum occurrence probability is selected as the target word segmentation sequence, so as to reduce the amount of subsequent calculations and improve the review efficiency of network content publishing.
  • the word sequence data of the preset training corpus is obtained, so that words can be used directly when calculating the probability of occurrence.
  • Sequence data which saves time for calculating probability and helps improve audit efficiency.
  • the basic sentence is analyzed by word segmentation to obtain M word segmentation sequences, and then for each word segmentation sequence, based on the word sequence data of the preset training corpus, Calculate the occurrence probability of each word segmentation sequence to obtain the occurrence probability of M word segmentation sequences.
  • each word segmentation in the target word segmentation sequence is used as the basic word segmentation contained in the basic sentence to ensure the accuracy of word segmentation, which is beneficial to improve the accuracy of subsequent clustering and semantic evaluation through basic word segmentation.
  • step S50 according to the semantic score of each basic sentence, the specific implementation process of determining the comprehensive score of the content to be published is detailed as follows:
  • M i is the i-th sentence based semantic scores
  • a and b are preset parameter
  • S i is the i th basis statement weighted score
  • W is the composite score content to be distributed
  • i and n are positive integers, and i ⁇ n.
  • the semantic score can be used to express the degree of semantic specification.
  • a semantic score less than 0 indicates that the semantics of the basic sentence is not standardized.
  • the preset parameter a is set to a larger value than the preset parameter b. , So that the non-standard basic sentence has a greater impact on the entire content to be published.
  • the values of the preset parameters a and b can be selected according to the actual situation, and there is no specific limitation here.
  • the semantic scores of different ranges are weighted and summarized by the preset formula to obtain the comprehensive score of the content to be published, which is beneficial to improve the rationality of the comprehensive score evaluation.
  • Fig. 6 shows a schematic block diagram of a device for reviewing network content publishing in one-to-one correspondence with the method for reviewing network content publishing in the foregoing embodiment.
  • the review device for publishing network content includes a request receiving module 10, a type matching module 20, a content analysis module 30, a semantic recognition module 40, a comprehensive scoring module 50 and a result determination module 60.
  • the detailed description of each functional module is as follows:
  • the request receiving module 10 is configured to obtain the current user information and the content to be published included in the review request if a review request for network content publishing sent by the client is received;
  • the type matching module 20 is used to match the current user information with each user information in the preset list type database to determine the user type corresponding to the current user information.
  • the list type database includes each user information and the corresponding user information. user type;
  • the content parsing module 30 is configured to, if the user type corresponding to the current user information is an ordinary user, analyze the content to be published according to a preset sentence division method to obtain each basic sentence contained in the content to be published;
  • the semantic recognition module 40 is used to perform semantic recognition on each basic sentence by using natural language semantic recognition, and obtain the semantic score corresponding to each basic sentence;
  • the comprehensive scoring module 50 is used to determine the comprehensive score of the content to be published according to the semantic score of each basic sentence
  • the result determination module 60 is used to compare the comprehensive score with a preset score threshold. If the comprehensive score is greater than the preset score threshold, confirm that the content to be published is legal, publish the content to be published, and send a message of approval to the client.
  • the verification device for publishing network content further includes:
  • the first review module 70 is configured to publish content to be published if the user type corresponding to the current user information is a whitelist user;
  • the second review module 80 is configured to remove the content to be published if the user type corresponding to the current user information is a blacklisted user, and send a message that the review failed to the client.
  • semantic recognition module 40 includes:
  • the word segmentation unit 41 is used to perform word segmentation processing on the basic sentence through a preset word segmentation method to obtain the basic word segmentation contained in the basic sentence;
  • the clustering unit 42 is configured to convert the basic word segmentation into a word vector, and cluster the word vector through a preset clustering algorithm to obtain the cluster center corresponding to each basic sentence;
  • the scoring unit 43 is used to calculate the distance between the cluster center corresponding to the basic sentence and each preset word meaning vector for each basic sentence, and use the preset word meaning vector corresponding to the minimum distance as the target vector, and set the semantics corresponding to the target vector The score serves as the semantic score corresponding to the basic sentence.
  • word segmentation unit 41 includes:
  • the training subunit 411 is used to obtain a preset training corpus, and use the N-gram model to analyze the preset training corpus to obtain word sequence data of the preset training corpus;
  • the parsing subunit 412 is used to perform word segmentation analysis on the basic sentence to obtain M word segmentation sequences;
  • the calculation subunit 413 is used to calculate the occurrence probability of each word sequence according to the word sequence data of the preset training corpus for each word segmentation sequence to obtain the occurrence probability of M word segmentation sequences;
  • the selection subunit 414 is used to select the word segmentation sequence corresponding to the occurrence probability that reaches the preset probability threshold from the occurrence probabilities of the M word segmentation sequences as the target word segmentation sequence, and use each word segmentation in the target word segmentation sequence as the basic sentence The basic participle contained in.
  • the comprehensive scoring module 50 includes:
  • the score calculation unit 51 is used to calculate the comprehensive score of the content to be published using the following formula:
  • M i is the i-th sentence based semantic scores
  • a and b are preset parameter
  • S i is the i th basis statement weighted score
  • W is the composite score content to be distributed
  • i and n are positive integers, and i ⁇ n.
  • Each module in the above-mentioned network content publishing review device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • Fig. 7 is a schematic diagram of a computer device provided by an embodiment of the present application.
  • the computer device may be a server, and its internal structure diagram may be as shown in Figure 7.
  • the computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer device is used to store the preset corpus and the preset word meaning vector.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection. When the computer program is executed by the processor, any one or more sets of steps of the above-disclosed method for reviewing network content publishing are realized.
  • a computer device including a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
  • the processor executes the computer program to implement the method for auditing network content publishing in the foregoing embodiment. , For example, steps S10 to S60 shown in FIG. 2.
  • the functions of the various modules/units of the verification device for publishing network content in the foregoing embodiment are implemented, for example, the functions of the modules 10 to 60 shown in FIG. 6. To avoid repetition, I won’t repeat them here.
  • a computer-readable storage medium is provided, the computer-readable storage medium is a volatile storage medium or a non-volatile storage medium, and the computer-readable storage medium stores a computer program, and the computer program When executed by a processor, the steps of the verification method for network content publishing in the foregoing embodiment are realized, or the computer program, when executed by a processor, realizes the functions of each module/unit in the verification apparatus for network content publishing in the foregoing embodiment. To avoid repetition, I won’t repeat them here.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Channel
  • memory bus Radbus direct RAM
  • RDRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

一种网络内容发布的审核方法、装置、计算机设备及存储介质,所述方法包括:在接收到网络内容发布的审核请求时,获取审核请求中包含的当前用户信息和待发布内容,并确定当前用户信息对应的用户类型,若当前用户信息对应的用户类型为普通用户,则对待发布内容进行解析,得到基础语句,进而采用自然语言语义识别的方式,对基础语句进行语义识别,得到基础语句对应的语义评分,再根据每个基础语句的语义评分,确定该待发布内容的综合评分,根据综合评分与预设评分阈值确认该待发布内容是否合法,实现智能化对网络内容进行语义识别,并根据识别出的语义来审核该网络内容发布是否合理,提高了网络内容发布的审核智能化程度和正确率。

Description

网络内容发布的审核方法、装置、计算机设备及存储介质
本申请要求于2019年6月17日提交中国专利局、申请号为201910522440.6,发明名称为“网络内容发布的审核方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及自然语言处理领域,尤其涉及一种网络内容发布的审核方法、装置、计算机设备及存储介质。
背景技术
随着科技的飞速发展和人们生活质量的日益提高,越来越多的人使用网络进行互动和学习,各类论坛也成为人们通过网络进行交流的热门途径之一。当前,每天都有数万论坛用户通过论坛进行发帖回帖来交流,这使得人们的交流越来越便捷,但不可避免地,也有少数人因个人情感问题,在网络论坛上发布散播一些低俗、暴力、迷信和反动的言论,这些言论有碍广大网民的正常沟通交流,因而,有必要在论坛用户进行发帖回帖时,对发布的内容进行审核,确保维护论坛积极健康的交流环境。
现有技术中,主要采用关键字检测的方式进行审核,发明人意识到这种审核方式只能根据预设的关键字进行匹配,进而判断发布内容是否规范,受限于关键字的设定,且容易被用户避开关键字进行发布不良内容,使得网络发布内容的审核智能化程度和正确率均较低。
发明内容
本申请实施例提供一种网络内容发布的审核方法、装置、计算机设备和存储介质,以解决当前关键字匹配的方式进行网络内容发布的审核,导致的审核智能化程度低和正确率低的问题。
一种网络内容发布的审核方法,包括:
若接收到客户端发送的网络内容发布的审核请求,则获取所述审核请求中包含的当前用户信息和待发布内容;
将所述当前用户信息与预设名单类型数据库中的每个用户信息进行匹配,确定所述当前用户信息对应的用户类型,其中,所述名单类型数据库包括每个用户信息和所述用户信息对应的用户类型;
若所述当前用户信息对应的用户类型为普通用户,则按照预设的语句划分方式,对所述待发布内容进行解析,得到所述待发布内容中包含的每个基础语句;
采用自然语言语义识别的方式,对每个所述基础语句进行语义识别,得到每个所述基础语句对应的语义评分;
根据每个所述基础语句的语义评分,确定所述待发布内容的综合评分;
将所述综合评分与预设评分阈值进行比较,若所述综合评分大于所述预设评分阈值,则确认所述待发布内容合法,发布所述待发布内容,并向所述客户端发送审核通过的消息。
一种网络内容发布的审核装置,包括:
请求接收模块,用于若接收到客户端发送的网络内容发布的审核请求,则获取所述审核请求中包含的当前用户信息和待发布内容;
类型匹配模块,用于将所述当前用户信息与预设名单类型数据库中的每个用户信息进行匹配,确定所述当前用户信息对应的用户类型,其中,所述名单类型数据库包括每个用户信息和所述用户信息对应的用户类型;
内容解析模块,用于若所述当前用户信息对应的用户类型为普通用户,则按照预设的语句划分方式,对所述待发布内容进行解析,得到所述待发布内容中包含的每个基础语句;
语义识别模块,用于采用自然语言语义识别的方式,对每个所述基础语句进行语义识别,得到每个所述基础语句对应的语义评分;
综合评分模块,用于根据每个所述基础语句的语义评分,确定所述待发布内容的综合评分;
结果确定模块,用于将所述综合评分与预设评分阈值进行比较,若所述综合评分大于所述预设评分阈值,则确认所述待发布内容合法,发布所述待发布内容,并向所述客户端发送审核通过的消息。
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现一种网络内容发布的审核方法,包括:在接收到客户端发送网络内容发布的审核请求时,获取审核请求中包含的当前用户信息和待发布内容,将当前用户信息与预设名单类型数据库中的每个用户信息进行匹配,确定当前用户信息对应的用户类型,若当前用户信息对应的用户类型为普通用户,则按照预设的语句划分方式,对待发布内容进行解析,得到待发布内容中包含的每个基础语句,进而采用自然语言语义识别的方式,对每个基础语句进行语义识别,得到每个基础语句对应的语义评分,再根据每个基础语句的语义评分,确定该待发布内容的综合评分,最后将综合评分与预设评分阈值进行比较,在综合评分大于预设评分阈值时,确认该待发布内容合法,发布该待发布内容,并向客户端发送审核通过的消息。
一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现一种网络内容发布的审核方法,包括:在接收到客户端发送网络内容发布的审核请求时,获取审核请求中包含的当前用户信息和待发布内容,将当前用户信息与预设名单类型数据库中的每个用户信息进行匹配,确定当前用户信息对应的用户类型,若当前用户信息对应的用户类型为普通用户,则按照预设的语句划分方式,对待发布内容进行解析,得到待发布内容中包含的每个基础语句,进而采用自然语言语义 识别的方式,对每个基础语句进行语义识别,得到每个基础语句对应的语义评分,再根据每个基础语句的语义评分,确定该待发布内容的综合评分,最后将综合评分与预设评分阈值进行比较,在综合评分大于预设评分阈值时,确认该待发布内容合法,发布该待发布内容,并向客户端发送审核通过的消息。
本申请实施例提供的网络内容发布的审核方法、装置、计算机设备及存储介质,实现智能化对网络内容进行语义识别,并根据识别出的语义来审核该网络内容发布是否合理,提高了网络内容发布的审核智能化程度和正确率。
附图说明
图1是本申请实施例提供的网络内容发布的审核方法的应用环境示意图;
图2是本申请实施例提供的网络内容发布的审核方法的实现流程图;
图3是本申请实施例提供的网络内容发布的审核方法中对非普通用户的审核流程图;
图4是本申请实施例提供的网络内容发布的审核方法中步骤S40的实现流程图;
图5是本申请实施例提供的网络内容发布的审核方法中步骤S41的实现流程图;
图6是本申请实施例提供的网络内容发布的审核装置的示意图;
图7是本申请实施例提供的计算机设备的示意图。
具体实施方式
请参阅图1,图1示出本申请实施例提供的网络内容发布的审核方法的应用环境。该网络内容发布的审核方法应用在对网络论坛、网络直播或者其他种类的网络社区中的网络内容发布的审核场景中。该记录场景包括客户端、服务端和管理端,其中,服务端和客户端之间、服务端和管理端之间通过网络进行连接,客户端向服务端发送网络内容发布的审核请求,服务端在获取到该审核请求后,判断用户类型,并根据用户类型确定审核方式,在用户类型为普通用户时,获取待发布的内容,并进行语义分析,得到待发布内容的语义评分,进而确定待发布内容的合法性,并在不合法时,向管理端发送相应提示信息。客户端和管理端具体可以但不限于是手机、平板电脑、个人计算机(Personal Computer,PC)等智能终端设备,服务端具体可以用独立的服务器或者多个服务器组成的服务器集群实现。
请参阅图2,图2示出本申请实施例提供的一种网络内容发布的审核方法,以该方法应用在图1中的服务端为例进行说明,详述如下:
S10:若接收到客户端发送的网络内容发布的审核请求,则获取审核请求中包含的当前用户信息和待发布内容。
具体地,用户在通过客户端进行论坛交流时,先编辑待发布内容,并在点击客户端的提交按钮后,客户端向服务端发送包含用户信息和待发布内容 的审核请求,服务端通过网络传输协议接收该审核请求中包含的用户信息和待发布内容。
其中,用户信息包括但不限于用户账号信息等,服务端通过用户账号信息,确定用户类型,在本实施例中,针对不同用户类型的用户,采用该用户类型对应的审核方式对待发布内容进行审核,以便提高网络内容发布的审核效率。
其中,待发布内容为用户在客户端编辑好,用于上传论坛或者其他网络社区,用于同其他网络用户进行互动的文字信息、链接信息、图像信息和视频信息等。
其中,网络传输协议包括但不限于:互联网控制报文协议(Internet Control Message Protocol,ICMP)、地址解析协议(ARP Address Resolution Protocol,ARP)和文件传输协议(File Transfer Protocol,FTP)等。
S20:将当前用户信息与预设名单类型数据库中的每个用户信息进行匹配,确定当前用户信息对应的用户类型,其中,名单类型数据库包括每个用户信息和用户信息对应的用户类型。
具体地,服务端存储有预设名单类型数据库,该预设名单类型数据库中包含所有注册用户的用户信息,以及每个用户信息对应的用户类型,通过采用遍历查询的方式,对该预设名单类型数据库进行查询,实现对步骤S10中获取到的用户信息进行用户类型判断,得到该用户信息对应的用户类型。
其中,预设名单类型数据库包含的用户类型可以包括:白名单用户、黑名单用户和普通用户类型等,不同用户类型的区分是根据对用户的信用等级来进行划分,例如,管理人员名单中的用户对应的信用等级比较高,一般会划分为白名单用户,多次涉嫌违规操作应该网络社区正常次序的用户,对应的信用等级偏低,在信用等级降低到一定程度,将被列入到黑名单用户类型的清单。
其中,用户类型为普通用户的用户信息,其对应的审核请求需要进一步进行智能评估,根据评估结果确定审核结果。
S30:若当前用户信息对应的用户类型为普通用户,则按照预设的语句划分方式,对待发布内容进行解析,得到待发布内容中包含的每个基础语句。
具体地,在用户信息对应的用户类型为普通用户时,按照预设的语句划分方式,对待发布内容进行解析,得到待发布内容中包含的每个基础语句。
在本实施例中,预设的语句划分方式可以是通过对预设的分隔符号进行正则匹配,进而以匹配到存在预设的分隔符号的位置为分隔点,对待发布内容进行切分,得到待发布内容中包含的每个基础语句。
其中,预设的分隔符号包括但不限于:分段符、换行符、标点符号等,具体可根据实际需求进行设置,此处不做限定。
S40:采用自然语言语义识别的方式,对每个基础语句进行语义识别,得到每个基础语句对应的语义评分。
具体地,通过自然语言语义识别的方式,对每个所述基础语句进行语义 识别,并根据预设的评分条件,对每个基础语句对应的语义进行评分,得到每个基础语句的语义评分。
其中,自然语言语义识别(Natural Language Processing,NLP)是人工智能(AI)的一个子领域,通过机器学习的方式,对自然语言进行理解解析,从而解决自然语言领域的一些问题,NLP主要应用范围包括但不限于:文本朗读(Text to speech)/语音合成(Speech synthesis)、语音识别(Speech recognition)、中文自动分词(Chinese word segmentation)、词性标注(Part-of-speech tagging)、句法分析(Parsing)、文本分类(Text categorization)、信息检索(Information retrieval)、自动摘要(Automatic summarization)和文字校对(Text-proofing)等。
S50:根据每个基础语句的语义评分,确定待发布内容的综合评分。
具体地,通过预设的加权方式,对每个基础语句的语义评分进行加权汇总,得到待发布内容的综合评分。
其中,预设的加权方式可以根据实际需求进行设定,例如,对于不同范围内的语义评分设置不同的加权系数等。
S60:将综合评分与预设评分阈值进行比较,若综合评分大于预设评分阈值,则确认待发布内容合法,发布待发布内容,并向客户端发送审核通过的消息。
具体地,服务端预设有评分阈值,将综合评分与预设评分阈值进行比较,在综合评分大于预设评分阈值时,确认待发布内容合法,发布该待发布内容,并向客户端发送审核通过的消息。
值得说明的是,在综合评分大于或等于预设评分阈值时,确认待发布内容可能涉嫌违规,将拒绝发布该待发布内容,并向客户端发送审核不通过的提示信息,并将该待发布内容审核请求进行记录,以便后续管理人员进行管理。
在本实施例中,在接收到客户端发送网络内容发布的审核请求时,获取审核请求中包含的当前用户信息和待发布内容,将当前用户信息与预设名单类型数据库中的每个用户信息进行比较,确定当前用户信息对应的用户类型,若当前用户信息对应的用户类型为普通用户,则按照预设的语句划分方式,对待发布内容进行解析,得到待发布内容中包含的每个基础语句,进而采用自然语言语义识别的方式,对每个基础语句进行语义识别,得到每个基础语句对应的语义评分,再根据每个基础语句的语义评分,确定该待发布内容的综合评分,最后将综合评分与预设评分阈值进行比较,在综合评分大于预设评分阈值时,确认该待发布内容合法,发布该待发布内容,并向客户端发送审核通过的消息,实现智能化对网络内容进行语义识别,并根据识别出的语义来审核该网络内容发布是否合理,提高了网络内容发布的审核智能化程度和正确率。
在一实施例中,请参阅图3,在步骤S20之后,该网络内容发布的审核方法还包括:
S70:若当前用户信息对应的用户类型为白名单用户,则发布待发布内容。
具体地,在通过采用遍历查询的方式,对该预设名单类型数据库进行查询后,确定当前用户信息对应的用户类型为白名单用户时,则直接发布该待发布内容。
S80:若当前用户信息对应的用户类型为黑名单用户,则移除待发布内容,并向客户端发送审核不通过的消息。
具体地,在通过采用遍历查询的方式,对该预设名单类型数据库进行查询后,确定当前用户信息对应的用户类型为黑名单用户时,则判断无需审核该待发布内容中包含语义信息,直接删除该待发布内容,并向客户端发送审核不通过的消息。
需要说明的是,步骤S70和步骤S80没有必然的先后执行顺序,其可以是并列执行的关系,此处不做限制。
在本实施例中,通过对用户类型为白名单用户和黑名单用户,按照预设方式进行快捷审核操作,而无需对这两种用户类型的用户的待发布内容进行语义识别,提高了网络内容发布的审核效率。
在图2对应的实施例的基础之上,下面通过一个具体的实施例来对步骤S40中所提及的采用自然语言语义识别的方式,对每个基础语句进行语义识别,得到每个基础语句对应的语义评分的具体实现方法进行详细说明。
请参阅图4,图4示出了本申请实施例提供的步骤S40的具体实现流程,详述如下:
S41:通过预设的分词方式,对基础语句进行分词处理,得到基础语句中包含的基础分词。
具体地,通过预设的分词方式,对步骤S30中得到的每个基础语句均进行分词处理,得到每个基础语句中包含的基础分词。
其中,预设的分词方式包括但不限于:通过第三方分词工具或者分词算法等。
其中,常见的第三方分词工具包括但不限于:Stanford NLP分词器、ICTClAS分词***、ansj分词工具和HanLP中文分词工具等。
其中,分词算法包括但不限于:最大正向匹配(Maximum Matching,MM)算法、逆向最大匹配(ReverseDirectionMaximum Matching Method,RMM)算法、双向最大匹配(Bi-directction Matching method,BM)算法、隐马尔科夫模型(Hidden Markov Model,HMM)和N-gram模型等。
容易理解地,通过分词的方式提取基础分词,一方面,可以过滤掉有效基础语句中一些无意义的词汇,另一方面,也有利于后续使用这些基础分词生成词向量。
S42:将基础分词转换为词向量,并通过预设的聚类算法,对词向量进行聚类,得到每个基础语句对应的聚类中心。
在人工智能中,语言表示主要指语言的形式化或数学的描述,以便在计算机中表示语言,并能让计算机程序自动处理。本申请实施例中所指的词向 量就是用向量的形式来表示一个基础分词。
具体地,先通过词向量的方式,对每个基础分词进行转化,得到每个基础分词对应的词向量,进而通过预设的聚类算法,对词向量进行聚类,得到每个基础分词对应的词向量的聚类中心,进而将同一个基础语句中的基础分词对应的聚类中心进行继续聚类,得到基础语句对应的聚类中心。
其中,聚类(Cluster)算法又称群分析,它是样品或指标分类问题的一种统计分析方法,同时也是数据挖掘的一个重要算法,聚类算法包括但不限于:K均值(K-Means)聚类算法、均值漂移聚类算法、基于密度的聚类(Density-Based Spatial Clustering of Applications with Noise,DBSCAN)方法、基于高斯混合模型的最大期望聚类、凝聚层次聚类和图团体检测(Graph Community Detection)算法等。
优选地,在本实施例中,采用K均值(K-Means)聚类算法,通过对各个基础分词对应的词向量进行聚类,确定每个基础分词对应的分类,进而对基础语句进行聚类,得到基础语句对应的聚类中心。
S43:针对每个基础语句,计算基础语句对应的聚类中心与每个预设词义向量的距离,并将最小距离对应的预设词义向量作为目标向量,将目标向量对应的语义评分作为基础语句对应的语义评分。
具体地,服务端预先存储有表示指定语义的预设语义向量,每个预设语义向量对应有预设的语义评分,针对每个基础语句,分别计算该基础语句对应的聚类中心与这些预设语义向量的距离,并将最小距离对应的预设词义向量作为目标向量,并将目标向量对应的语义评分作为该基础语句的语义评分。
优选地,在本实施例中,确定目标向量之后,还可根据基础语句与目标向量的距离,计算评分参数,并根据评分参数和目标向量对应的语义评分确定基础语句的语义评分。
在本实施例中,通过预设的分词方式,对基础语句进行分词处理,得到基础语句中包含的基础分词,进而将基础分词转换为词向量,并通过预设的聚类算法,对词向量进行聚类,得到每个基础语句对应的聚类中心,针对每个基础语句,计算基础语句对应的聚类中心与每个预设词义向量的距离,并将最小距离对应的预设词义向量作为目标向量,将目标向量对应的语义评分作为基础语句对应的语义评分,实现了对基础语句的语义评分,提高了审核的智能化程度和审核效率。
在图2对应的实施例的基础之上,下面通过一个具体的实施例来对步骤S41中所提及的通过预设的分词方式,对基础语句进行分词处理,得到基础语句中包含的基础分词的具体实现方法进行详细说明。
请参阅图5,图5示出了本申请实施例提供的步骤S41的具体实现流程,详述如下:
S411:获取预设的训练语料库,并使用N-gram模型对预设的训练语料库进行分析,得到预设的训练语料库的词序列数据。
具体地,训练语料库是用来为了对使用自然语言中的基础语句进行评估, 而使用相关语料进行训练得到的语料库,通过使用N-gram模型对预设的训练语料库中每个语料进行统计分析,得出预设的训练语料库中一个语料H出现在另一个语料I之后的次数,进而得到“语料I+语料H”组成的词序列出现的词序列数据。本申请实施例中训练语料库中的内容包含但不限于:论坛或网络社区的话题对应的专业信息、网络语料和通用语料库等。
其中,语料库(Corpus)是指经科学取样和加工的大规模电子文本库。语料库是语言学研究的基础资源,也是经验主义语言研究方法的主要资源,应用于词典编纂,语言教学,传统语言研究,自然语言处理中基于统计或实例的研究等方面,语料,即语言材料,语料是语言学研究的内容,也是构成语料库的基本单元。
例如,在一具体实施方式中,预设的训练语料库为通过对热门网络话题和时事新闻通过网络爬虫的方式进行爬取,得到“时事”领域的语料库。
其中,词序列是指由至少两个语料按照一定顺序组合而成的序列,词序列频度是指该词序列出现的次数占整个语料库中分词(Word Segmentation)出现次数的比例,这里的分词指的是将连续的字序列按照预设的组合方式进行组合得到的词序列。例如,某个词序列“爱吃西红柿”在整个语料库中出现的次数为100次,整个语料库所有分词出现的次数之和为100000次,则词序列“爱吃西红柿”的词序列频度为0.0001。
其中,N-gram模型是大词汇连续文字语义识别中常用的一种语言模型,利用上下文中相邻词间的搭配信息,在需要把连续无空格的文字转换成汉字串(即句子)时,可以计算出具有最大概率的句子,从而实现到汉字的自动转换,无需用户手动选择,提高了词序列确定的准确性。
值得说明的是,为了提高网络内容发布的审核效率,在本实施例中,获取预设的训练语料库,并使用N-gram模型对预设的训练语料库进行分析,得到预设的训练语料库的词序列数据的过程,可以在审核之前进行,并将得到的词序列数据进行存储,在需要对待发布内容进行语义识别时,直接调用该词序列数据即可。
S412:对基础语句进行分词解析,得到M个分词序列。
具体地,每个基础语句,断句方式不一样,理解出的语句可能存在差别,为保证语句理解的正确性,服务端在获取到基础语句后,获取该基础语句的M个分词序列的组成,M为所有可能出现的分词序列的总数。
其中,每个分词序列均是将一个基础语句进行划分得到的一种结果,得到的包含至少两个分词的文字序列。
例如,在一具体实施方式中,一基础语句为“今天真热”,对该基础语句进行解析,得到分词序列A为:“今天”、“真”、“热”,得到分词序列B为:“今”、“天真”、“热”等。
S413:针对每个分词序列,依据预设的训练语料库的词序列数据,计算每个分词序列的发生概率,得到M个分词序列的发生概率。
具体地,根据步骤S412中获取到的词序列数据,对每个分词序列进行发 生概率计算,得到M个分词序列的发生概率。
对分词序列计算发生概率具体可使用马尔科夫假设理论:第Y个词的出现只与前面Y-1个词相关,而与其它任何词都不相关,整句的概率就是各个词出现概率的乘积。这些概率可以通过直接从语料中统计Y个词同时出现的次数得到。即:
P(T)=P(W 1W 2...W Y)=P(W 1)P(W 2|W 1)...P(W Y|W 1W 2...W Y-1)   公式(1)
其中,P(T)为整句出现的概率,P(W Y|W 1W 2...W Y-1)为第Y个分词出现在Y-1个分词组成的词序列之后的概率。
例如:在“中华民族是一个有着悠久文明历史的民族”这句话进行语音识别后,划分的一种分词序列为:“中华民族”、“是”、“一个”、“有着”、“悠久”、“文明”、“历史”、“的”、“民族”,一共出现了9个分词,当n=9的时候,即计算“民族”这个分词在出现在“中华民族是一个有着悠久文明历史的”这个词序列之后的概率。
S414:从M个分词序列的发生概率中,选取达到预设概率阈值的发生概率对应的分词序列,作为目标分词序列,并将目标分词序列中的每个分词,作为基础语句中包含的基础分词。
具体地,针对每个分词序列,通过步骤S413的计算均得到一个发生概率,共得到M个分词序列的发生概率,将这M个分词序列的发生概率分别与预设概率阈值进行比较,选取大于或者等于预设概率阈值的发生概率,作为有效发生概率,进而找到有效发生概率对应的分词序列,将这些分词序列作为目标分词序列。
通过与预设概率阈值进行比较,过滤掉发生概率不符合要求的分词序列,从而使得选取的目标分词序列更为接近自然语言中表达的含义,提高了语义识别的准确率。
需要说明的是,若计算出的M个分词序列的发生概率均小于预设的概率阈值,则确定该待发布内容为不符合规范的内容,此时,将审核不通过作为审核结果,并向客户端发送“请遵守网络发言规范,做一个文明的网民”的提醒消息。若目标分词序列个数大于预设个数,按照其对应的发生概率的大小顺序进行排序,并选取排序前面的预设个数分词序列作为目标分词序列,例如,预设的个数为5个,则在将有效发生概率进行排序后,选取排序前5个的有效发生概率,进而得到这5个发生概率对应的分词序列作为目标分词序列。
优选地,在本实施例中,选取最大发生概率对应的分词序列,作为目标分词序列,以便减少后续的运算量,提高网络内容发布的审核效率。
在本实施例中,通过获取预设的训练语料库,并使用N-gram模型对预设的训练语料库进行分析,得到预设的训练语料库的词序列数据,方便后续计算发生概率时可直接使用词序列数据,从而节省了计算概率的时间,有利于提高审核效率,同时,对基础语句进行分词解析,得到M个分词序列,进而针对每个分词序列,依据预设的训练语料库的词序列数据,计算每个分词序 列的发生概率,得到M个分词序列的发生概率,再从M个分词序列的发生概率中,选取达到预设概率阈值的发生概率对应的分词序列,作为目标分词序列,并将目标分词序列中的每个分词,作为基础语句中包含的基础分词,确保分词的准确性,有利于提高后续通过基础分词进行聚类和语义评估的准确率。
在一实施例中,步骤S50中,根据每个基础语句的语义评分,确定待发布内容的综合评分具体实现流程,详述如下:
通过如下公式计算待发布内容的综合评分:
Figure PCTCN2020085582-appb-000001
Figure PCTCN2020085582-appb-000002
其中,M i为第i个基础语句的语义评分,a和b为预设参数,S i为第i个基础语句的加权评分,n为基础语句的数量,W为待发布内容的综合评分,i和n为正整数,且i≤n。
值得说明的是,在本实施例中,语义评分可用于表达语义规范的程度,语义评分小于0表明该基础语句的语义存在不规范,对预设参数a设置比预设参数b更大的值,使得不规范的基础语句对整个待发布内容的影响更大,预设参数a和b的取值可以根据实际情况进行选取,此处不作具体限制。
在本实施例中,通过预设公式,对不同范围的语义评分进行加权汇总,得到待发布内容的综合评分,有利于提高综合评分评估的合理性。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
图6示出与上述实施例网络内容发布的审核方法一一对应的网络内容发布的审核装置的原理框图。如图6所示,该网络内容发布的审核装置包括请求接收模块10、类型匹配模块20、内容解析模块30、语义识别模块40、综合评分模块50和结果确定模块60。各功能模块详细说明如下:
请求接收模块10,用于若接收到客户端发送的网络内容发布的审核请求,则获取审核请求中包含的当前用户信息和待发布内容;
类型匹配模块20,用于将当前用户信息与预设名单类型数据库中的每个用户信息进行匹配,确定当前用户信息对应的用户类型,其中,名单类型数据库包括每个用户信息和用户信息对应的用户类型;
内容解析模块30,用于若当前用户信息对应的用户类型为普通用户,则按照预设的语句划分方式,对待发布内容进行解析,得到待发布内容中包含的每个基础语句;
语义识别模块40,用于采用自然语言语义识别的方式,对每个基础语句进行语义识别,得到每个基础语句对应的语义评分;
综合评分模块50,用于根据每个基础语句的语义评分,确定待发布内容 的综合评分;
结果确定模块60,用于将综合评分与预设评分阈值进行比较,若综合评分大于预设评分阈值,则确认待发布内容合法,发布待发布内容,并向客户端发送审核通过的消息。
进一步地,该网络内容发布的审核装置还包括:
第一审核模块70,用于若当前用户信息对应的用户类型为白名单用户,则发布待发布内容;
第二审核模块80,用于若当前用户信息对应的用户类型为黑名单用户,则移除待发布内容,并向客户端发送审核不通过的消息。
进一步地,语义识别模块40包括:
分词单元41,用于通过预设的分词方式,对基础语句进行分词处理,得到基础语句中包含的基础分词;
聚类单元42,用于将基础分词转换为词向量,并通过预设的聚类算法,对词向量进行聚类,得到每个基础语句对应的聚类中心;
评分单元43,用于针对每个基础语句,计算基础语句对应的聚类中心与每个预设词义向量的距离,并将最小距离对应的预设词义向量作为目标向量,将目标向量对应的语义评分作为基础语句对应的语义评分。
进一步地,分词单元41包括:
训练子单元411,用于获取预设的训练语料库,并使用N-gram模型对预设的训练语料库进行分析,得到预设的训练语料库的词序列数据;
解析子单元412,用于对基础语句进行分词解析,得到M个分词序列;
计算子单元413,用于针对每个分词序列,依据预设的训练语料库的词序列数据,计算每个分词序列的发生概率,得到M个分词序列的发生概率;
选取子单元414,用于从M个分词序列的发生概率中,选取达到预设概率阈值的发生概率对应的分词序列,作为目标分词序列,并将目标分词序列中的每个分词,作为基础语句中包含的基础分词。
进一步地,综合评分模块50包括:
评分计算单元51,用于通过如下公式计算待发布内容的综合评分:
Figure PCTCN2020085582-appb-000003
Figure PCTCN2020085582-appb-000004
其中,M i为第i个基础语句的语义评分,a和b为预设参数,S i为第i个基础语句的加权评分,n为基础语句的数量,W为待发布内容的综合评分,i和n为正整数,且i≤n。
关于网络内容发布的审核装置的具体限定可以参见上文中对于网络内容发布的审核方法的限定,在此不再赘述。上述网络内容发布的审核装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于 计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
图7是本申请一实施例提供的计算机设备的示意图。该计算机设备可以是服务端,其内部结构图可以如图7所示。该计算机设备包括通过***总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作***、计算机程序和数据库。该内存储器为非易失性存储介质中的操作***和计算机程序的运行提供环境。该计算机设备的数据库用于存储预设的语料库和预设词义向量。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现上述公开的一种网络内容发布的审核方法的任意一组或多组步骤。
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行计算机程序时实现上述实施例网络内容发布的审核方法的步骤,例如图2所示的步骤S10至步骤S60。或者,处理器执行计算机程序时实现上述实施例网络内容发布的审核装置的各模块/单元的功能,例如图6所示的模块10至模块60的功能。为避免重复,这里不再赘述。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。
在一实施例中,提供一计算机可读存储介质,所述计算机可读存储介质为易失性存储介质或非易失性存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现上述实施例网络内容发布的审核方法的步骤,或者,该计算机程序被处理器执行时实现上述实施例网络内容发布的审核装置中各模块/单元的功能。为避免重复,这里不再赘述。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。

Claims (18)

  1. 一种网络内容发布的审核方法,其中,所述网络内容发布的审核方法包括:
    若接收到客户端发送的网络内容发布的审核请求,则获取所述审核请求中包含的当前用户信息和待发布内容;
    将所述当前用户信息与预设名单类型数据库中的每个用户信息进行匹配,确定所述当前用户信息对应的用户类型,其中,所述名单类型数据库包括每个用户信息和所述用户信息对应的用户类型;
    若所述当前用户信息对应的用户类型为普通用户,则按照预设的语句划分方式,对所述待发布内容进行解析,得到所述待发布内容中包含的每个基础语句;
    采用自然语言语义识别的方式,对每个所述基础语句进行语义识别,得到每个所述基础语句对应的语义评分;
    根据每个所述基础语句的语义评分,确定所述待发布内容的综合评分;
    将所述综合评分与预设评分阈值进行比较,若所述综合评分大于所述预设评分阈值,则确认所述待发布内容合法,发布所述待发布内容,并向所述客户端发送审核通过的消息。
  2. 如权利要求1所述的网络内容发布的审核方法,其中,在所述将所述当前用户信息与预设名单类型数据库中的每个用户信息进行匹配,确定所述当前用户信息对应的用户类型之后,所述网络内容发布的审核方法还包括:
    若所述当前用户信息对应的用户类型为白名单用户,则发布所述待发布内容;
    若所述当前用户信息对应的用户类型为黑名单用户,则移除所述待发布内容,并向所述客户端发送审核不通过的消息。
  3. 如权利要求1所述的网络内容发布的审核方法,其中,所述采用自然语言语义识别的方式,对每个所述基础语句进行语义识别,得到每个所述基础语句对应的语义评分包括:
    通过预设的分词方式,对所述基础语句进行分词处理,得到所述基础语句中包含的基础分词;
    将所述基础分词转换为词向量,并通过预设的聚类算法,对所述词向量进行聚类,得到每个所述基础语句对应的聚类中心;
    针对每个所述基础语句,计算所述基础语句对应的聚类中心与每个预设词义向量的距离,并将最小距离对应的预设词义向量作为目标向量,将目标向量对应的语义评分作为所述基础语句对应的语义评分。
  4. 如权利要求3所述的网络内容发布的审核方法,其中,在所述采用自然语言语义识别的方式,对每个所述基础语句进行语义识别,得到每个所述基础语句对应的语义评分之前,所述网络内容发布的审核方法还包括:
    获取预设的训练语料库,并使用N-gram模型对所述预设的训练语料库进 行分析,得到所述预设的训练语料库的词序列数据;
    所述通过预设的分词方式,对所述基础语句进行分词处理,得到所述基础语句中包含的基础分词包括:
    对所述基础语句进行分词解析,得到M个分词序列;
    针对每个所述分词序列,依据所述预设的训练语料库的词序列数据,计算每个分词序列的发生概率,得到M个分词序列的发生概率;
    从M个所述分词序列的发生概率中,选取达到预设概率阈值的发生概率对应的所述分词序列,作为目标分词序列,并将目标分词序列中的每个分词,作为所述基础语句中包含的基础分词。
  5. 如权利要求1至4任一项所述的网络内容发布的审核方法,其中,所述根据每个所述基础语句的语义评分,确定所述待发布内容的综合评分包括:
    通过如下公式计算待发布内容的综合评分:
    Figure PCTCN2020085582-appb-100001
    Figure PCTCN2020085582-appb-100002
    其中,M i为第i个所述基础语句的语义评分,a和b为预设参数,S i为第i个所述基础语句的加权评分,n为所述基础语句的数量,W为所述待发布内容的综合评分,i和n为正整数,且i≤n。
  6. 一种网络内容发布的审核装置,其中,所述网络内容发布的审核装置包括:
    请求接收模块,用于若接收到客户端发送的网络内容发布的审核请求,则获取所述审核请求中包含的当前用户信息和待发布内容;
    类型匹配模块,用于将所述当前用户信息与预设名单类型数据库中的每个用户信息进行匹配,确定所述当前用户信息对应的用户类型,其中,所述名单类型数据库包括每个用户信息和所述用户信息对应的用户类型;
    内容解析模块,用于若所述当前用户信息对应的用户类型为普通用户,则按照预设的语句划分方式,对所述待发布内容进行解析,得到所述待发布内容中包含的每个基础语句;
    语义识别模块,用于采用自然语言语义识别的方式,对每个所述基础语句进行语义识别,得到每个所述基础语句对应的语义评分;
    综合评分模块,用于根据每个所述基础语句的语义评分,确定所述待发布内容的综合评分;
    结果确定模块,用于将所述综合评分与预设评分阈值进行比较,若所述综合评分大于所述预设评分阈值,则确认所述待发布内容合法,发布所述待发布内容,并向所述客户端发送审核通过的消息。
  7. 如权利要求6所述的网络内容发布的审核装置,其中,所述网络内容发布的审核装置还包括:
    第一审核模块,用于若所述当前用户信息对应的用户类型为白名单用户, 则发布所述待发布内容;
    第二审核模块,用于若所述当前用户信息对应的用户类型为黑名单用户,则移除所述待发布内容,并向所述客户端发送审核不通过的消息。
  8. 如权利要求6所述的网络内容发布的审核装置,其中,所述语义识别模块包括:
    分词单元,用于通过预设的分词方式,对所述基础语句进行分词处理,得到所述基础语句中包含的基础分词;
    聚类单元,用于将所述基础分词转换为词向量,并通过预设的聚类算法,对所述词向量进行聚类,得到每个所述基础语句对应的聚类中心;
    评分单元,用于针对每个所述基础语句,计算所述基础语句对应的聚类中心与每个预设词义向量的距离,并将最小距离对应的预设词义向量作为目标向量,将目标向量对应的语义评分作为所述基础语句对应的语义评分。
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现一种网络内容发布的审核方法,包括:
    若接收到客户端发送的网络内容发布的审核请求,则获取所述审核请求中包含的当前用户信息和待发布内容;
    将所述当前用户信息与预设名单类型数据库中的每个用户信息进行匹配,确定所述当前用户信息对应的用户类型,其中,所述名单类型数据库包括每个用户信息和所述用户信息对应的用户类型;
    若所述当前用户信息对应的用户类型为普通用户,则按照预设的语句划分方式,对所述待发布内容进行解析,得到所述待发布内容中包含的每个基础语句;
    采用自然语言语义识别的方式,对每个所述基础语句进行语义识别,得到每个所述基础语句对应的语义评分;
    根据每个所述基础语句的语义评分,确定所述待发布内容的综合评分;
    将所述综合评分与预设评分阈值进行比较,若所述综合评分大于所述预设评分阈值,则确认所述待发布内容合法,发布所述待发布内容,并向所述客户端发送审核通过的消息。
  10. 如权利要求9所述的计算机设备,其中,在所述将所述当前用户信息与预设名单类型数据库中的每个用户信息进行匹配,确定所述当前用户信息对应的用户类型之后,还包括:
    若所述当前用户信息对应的用户类型为白名单用户,则发布所述待发布内容;
    若所述当前用户信息对应的用户类型为黑名单用户,则移除所述待发布内容,并向所述客户端发送审核不通过的消息。
  11. 如权利要求9所述的计算机设备,其中,所述采用自然语言语义识别的方式,对每个所述基础语句进行语义识别,得到每个所述基础语句对应的语义评分包括:
    通过预设的分词方式,对所述基础语句进行分词处理,得到所述基础语句中包含的基础分词;
    将所述基础分词转换为词向量,并通过预设的聚类算法,对所述词向量进行聚类,得到每个所述基础语句对应的聚类中心;
    针对每个所述基础语句,计算所述基础语句对应的聚类中心与每个预设词义向量的距离,并将最小距离对应的预设词义向量作为目标向量,将目标向量对应的语义评分作为所述基础语句对应的语义评分。
  12. 如权利要求11所述的计算机设备,其中,在所述采用自然语言语义识别的方式,对每个所述基础语句进行语义识别,得到每个所述基础语句对应的语义评分之前,还包括:
    获取预设的训练语料库,并使用N-gram模型对所述预设的训练语料库进行分析,得到所述预设的训练语料库的词序列数据;
    所述通过预设的分词方式,对所述基础语句进行分词处理,得到所述基础语句中包含的基础分词包括:
    对所述基础语句进行分词解析,得到M个分词序列;
    针对每个所述分词序列,依据所述预设的训练语料库的词序列数据,计算每个分词序列的发生概率,得到M个分词序列的发生概率;
    从M个所述分词序列的发生概率中,选取达到预设概率阈值的发生概率对应的所述分词序列,作为目标分词序列,并将目标分词序列中的每个分词,作为所述基础语句中包含的基础分词。
  13. 如权利要求9至12任一项所述的计算机设备,其中,所述根据每个所述基础语句的语义评分,确定所述待发布内容的综合评分包括:
    通过如下公式计算待发布内容的综合评分:
    Figure PCTCN2020085582-appb-100003
    Figure PCTCN2020085582-appb-100004
    其中,M i为第i个所述基础语句的语义评分,a和b为预设参数,S i为第i个所述基础语句的加权评分,n为所述基础语句的数量,W为所述待发布内容的综合评分,i和n为正整数,且i≤n。
  14. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序被处理器执行一种所述的网络内容发布的审核方法,包括:
    若接收到客户端发送的网络内容发布的审核请求,则获取所述审核请求中包含的当前用户信息和待发布内容;
    将所述当前用户信息与预设名单类型数据库中的每个用户信息进行匹配,确定所述当前用户信息对应的用户类型,其中,所述名单类型数据库包括每个用户信息和所述用户信息对应的用户类型;
    若所述当前用户信息对应的用户类型为普通用户,则按照预设的语句划 分方式,对所述待发布内容进行解析,得到所述待发布内容中包含的每个基础语句;
    采用自然语言语义识别的方式,对每个所述基础语句进行语义识别,得到每个所述基础语句对应的语义评分;
    根据每个所述基础语句的语义评分,确定所述待发布内容的综合评分;
    将所述综合评分与预设评分阈值进行比较,若所述综合评分大于所述预设评分阈值,则确认所述待发布内容合法,发布所述待发布内容,并向所述客户端发送审核通过的消息。
  15. 如权利要求14所述的存储介质,其中,在所述将所述当前用户信息与预设名单类型数据库中的每个用户信息进行匹配,确定所述当前用户信息对应的用户类型之后,还包括:
    若所述当前用户信息对应的用户类型为白名单用户,则发布所述待发布内容;
    若所述当前用户信息对应的用户类型为黑名单用户,则移除所述待发布内容,并向所述客户端发送审核不通过的消息。
  16. 如权利要求14所述的存储介质,其中,所述采用自然语言语义识别的方式,对每个所述基础语句进行语义识别,得到每个所述基础语句对应的语义评分包括:
    通过预设的分词方式,对所述基础语句进行分词处理,得到所述基础语句中包含的基础分词;
    将所述基础分词转换为词向量,并通过预设的聚类算法,对所述词向量进行聚类,得到每个所述基础语句对应的聚类中心;
    针对每个所述基础语句,计算所述基础语句对应的聚类中心与每个预设词义向量的距离,并将最小距离对应的预设词义向量作为目标向量,将目标向量对应的语义评分作为所述基础语句对应的语义评分。
  17. 如权利要求16所述的存储介质,其中,在所述采用自然语言语义识别的方式,对每个所述基础语句进行语义识别,得到每个所述基础语句对应的语义评分之前,还包括:
    获取预设的训练语料库,并使用N-gram模型对所述预设的训练语料库进行分析,得到所述预设的训练语料库的词序列数据;
    所述通过预设的分词方式,对所述基础语句进行分词处理,得到所述基础语句中包含的基础分词包括:
    对所述基础语句进行分词解析,得到M个分词序列;
    针对每个所述分词序列,依据所述预设的训练语料库的词序列数据,计算每个分词序列的发生概率,得到M个分词序列的发生概率;
    从M个所述分词序列的发生概率中,选取达到预设概率阈值的发生概率对应的所述分词序列,作为目标分词序列,并将目标分词序列中的每个分词,作为所述基础语句中包含的基础分词。
  18. 如权利要求14至17任一项所述的存储介质,其中,所述根据每个所 述基础语句的语义评分,确定所述待发布内容的综合评分包括:
    通过如下公式计算待发布内容的综合评分:
    Figure PCTCN2020085582-appb-100005
    Figure PCTCN2020085582-appb-100006
    其中,M i为第i个所述基础语句的语义评分,a和b为预设参数,S i为第i个所述基础语句的加权评分,n为所述基础语句的数量,W为所述待发布内容的综合评分,i和n为正整数,且i≤n。。
PCT/CN2020/085582 2019-06-17 2020-04-20 网络内容发布的审核方法、装置、计算机设备及存储介质 WO2020253350A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910522440.6 2019-06-17
CN201910522440.6A CN110377900A (zh) 2019-06-17 2019-06-17 网络内容发布的审核方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2020253350A1 true WO2020253350A1 (zh) 2020-12-24

Family

ID=68248961

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/085582 WO2020253350A1 (zh) 2019-06-17 2020-04-20 网络内容发布的审核方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN110377900A (zh)
WO (1) WO2020253350A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112783917A (zh) * 2021-01-04 2021-05-11 广州海量数据库技术有限公司 工单审核方法及装置、存储介质及电子设备
CN113835730A (zh) * 2021-09-24 2021-12-24 支付宝(杭州)信息技术有限公司 一种更新审核程序的方法、装置、设备及介质

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110377900A (zh) * 2019-06-17 2019-10-25 深圳壹账通智能科技有限公司 网络内容发布的审核方法、装置、计算机设备及存储介质
CN111125023A (zh) * 2019-11-15 2020-05-08 北京十分科技有限公司 文件的审核、审核控制、发布方法及对应装置
CN110929055B (zh) * 2019-11-15 2023-05-02 北京达佳互联信息技术有限公司 多媒体质量检测方法、装置、电子设备及存储介质
CN111209363B (zh) * 2019-12-25 2024-02-09 华为技术有限公司 语料数据处理方法、装置、服务器和存储介质
CN111309938A (zh) * 2020-01-22 2020-06-19 恒大新能源汽车科技(广东)有限公司 一种多媒体文件处理方法及装置
CN111414515A (zh) * 2020-03-17 2020-07-14 中国建设银行股份有限公司 一种资源审核方法、装置、设备及存储介质
CN113761182A (zh) * 2020-06-17 2021-12-07 北京沃东天骏信息技术有限公司 一种确定业务问题的方法和装置
CN112163585B (zh) * 2020-11-10 2023-11-10 上海七猫文化传媒有限公司 文本的审核方法、装置、计算机设备及存储介质
CN112464036B (zh) * 2020-11-24 2023-06-16 行吟信息科技(武汉)有限公司 一种违规数据的审核方法及装置
CN112906387B (zh) * 2020-12-25 2023-08-04 北京百度网讯科技有限公司 风险内容识别方法、装置、设备、介质和计算机程序产品
CN113010708B (zh) * 2021-03-11 2023-08-25 上海麦糖信息科技有限公司 针对违规朋友圈内容以及违规聊天内容的审核方法及***
CN114245160A (zh) * 2021-12-07 2022-03-25 北京达佳互联信息技术有限公司 信息处理方法、装置、电子设备及存储介质
CN116822494B (zh) * 2023-08-28 2023-12-08 深圳有咖互动科技有限公司 广播剧信息处理方法、装置、电子设备和计算机可读介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101446970A (zh) * 2008-12-15 2009-06-03 腾讯科技(深圳)有限公司 一种对用户发布的文本内容审核处理的方法及其装置
CN102096680A (zh) * 2009-12-15 2011-06-15 北京大学 信息有效性分析的方法和装置
CN102098332A (zh) * 2010-12-30 2011-06-15 北京新媒传信科技有限公司 一种内容审核方法和装置
US8224950B2 (en) * 1997-03-25 2012-07-17 Symantec Corporation System and method for filtering data received by a computer system
CN109635073A (zh) * 2018-10-18 2019-04-16 深圳壹账通智能科技有限公司 论坛社区应用管理方法、装置、设备及计算机可读存储介质
CN110377900A (zh) * 2019-06-17 2019-10-25 深圳壹账通智能科技有限公司 网络内容发布的审核方法、装置、计算机设备及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6334697B2 (ja) * 2013-11-08 2018-05-30 グーグル エルエルシー ディスプレイコンテンツのイメージを抽出し、生成するシステムおよび方法
CN109800307B (zh) * 2019-01-18 2022-08-02 深圳壹账通智能科技有限公司 产品评价的分析方法、装置、计算机设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8224950B2 (en) * 1997-03-25 2012-07-17 Symantec Corporation System and method for filtering data received by a computer system
CN101446970A (zh) * 2008-12-15 2009-06-03 腾讯科技(深圳)有限公司 一种对用户发布的文本内容审核处理的方法及其装置
CN102096680A (zh) * 2009-12-15 2011-06-15 北京大学 信息有效性分析的方法和装置
CN102098332A (zh) * 2010-12-30 2011-06-15 北京新媒传信科技有限公司 一种内容审核方法和装置
CN109635073A (zh) * 2018-10-18 2019-04-16 深圳壹账通智能科技有限公司 论坛社区应用管理方法、装置、设备及计算机可读存储介质
CN110377900A (zh) * 2019-06-17 2019-10-25 深圳壹账通智能科技有限公司 网络内容发布的审核方法、装置、计算机设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112783917A (zh) * 2021-01-04 2021-05-11 广州海量数据库技术有限公司 工单审核方法及装置、存储介质及电子设备
CN113835730A (zh) * 2021-09-24 2021-12-24 支付宝(杭州)信息技术有限公司 一种更新审核程序的方法、装置、设备及介质

Also Published As

Publication number Publication date
CN110377900A (zh) 2019-10-25

Similar Documents

Publication Publication Date Title
WO2020253350A1 (zh) 网络内容发布的审核方法、装置、计算机设备及存储介质
CN110765244B (zh) 获取应答话术的方法、装置、计算机设备及存储介质
CN108376151B (zh) 问题分类方法、装置、计算机设备和存储介质
CN108304375B (zh) 一种信息识别方法及其设备、存储介质、终端
US7783476B2 (en) Word extraction method and system for use in word-breaking using statistical information
US20190347571A1 (en) Classifier training
US9785684B2 (en) Determining temporal categories for a domain of content for natural language processing
US20070192309A1 (en) Method and system for identifying sentence boundaries
US9483582B2 (en) Identification and verification of factual assertions in natural language
CN110928994A (zh) 相似案例检索方法、相似案例检索装置和电子设备
WO2021114841A1 (zh) 一种用户报告的生成方法及终端设备
CN112328742A (zh) 基于人工智能的培训方法、装置、计算机设备及存储介质
WO2020077825A1 (zh) 论坛社区应用管理方法、装置、设备及可读存储介质
CN111767393A (zh) 一种文本核心内容提取方法及装置
CN111985228A (zh) 文本关键词提取方法、装置、计算机设备和存储介质
CN114896305A (zh) 一种基于大数据技术的智慧互联网安全平台
CN111552798B (zh) 基于名称预测模型的名称信息处理方法、装置、电子设备
CN113343108A (zh) 推荐信息处理方法、装置、设备及存储介质
TWI734085B (zh) 使用意圖偵測集成學習之對話系統及其方法
WO2022134834A1 (zh) 潜在事件预测方法、装置、设备及存储介质
CN110019763B (zh) 文本过滤方法、***、设备及计算机可读存储介质
CN111930949B (zh) 搜索串处理方法、装置、计算机可读介质及电子设备
WO2023035529A1 (zh) 基于意图识别的信息智能查询方法、装置、设备及介质
CN113177164B (zh) 基于大数据的多平台协同新媒体内容监控管理***
CN112507115B (zh) 一种弹幕文本中情感词的分类方法、装置及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20827002

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20827002

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 29/03/2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20827002

Country of ref document: EP

Kind code of ref document: A1