WO2021174926A1 - 一种网站不良信息监测***及其监测方法 - Google Patents

一种网站不良信息监测***及其监测方法 Download PDF

Info

Publication number
WO2021174926A1
WO2021174926A1 PCT/CN2020/132692 CN2020132692W WO2021174926A1 WO 2021174926 A1 WO2021174926 A1 WO 2021174926A1 CN 2020132692 W CN2020132692 W CN 2020132692W WO 2021174926 A1 WO2021174926 A1 WO 2021174926A1
Authority
WO
WIPO (PCT)
Prior art keywords
monitoring
audio
text
information
bad
Prior art date
Application number
PCT/CN2020/132692
Other languages
English (en)
French (fr)
Inventor
虞焰兴
Original Assignee
安徽声讯信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 安徽声讯信息技术有限公司 filed Critical 安徽声讯信息技术有限公司
Publication of WO2021174926A1 publication Critical patent/WO2021174926A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the invention relates to the technical field of network information security, in particular to a website bad information monitoring system and a monitoring method thereof.
  • the virtual online world and the real world are equal.
  • the real world has beauty and ugliness, good and evil
  • the online world also has beauty and ugliness, good and evil.
  • bad information was still very limited.
  • bad information began to gradually spread.
  • bad information has even developed into an industry, and it has begun to transform from purely “knowledge-based” information to “profit-making”, with various methods and complex forms, including many bad information that violates laws and morals.
  • the invention patent application CN107547555A published by the State Intellectual Property Office on January 5, 2018 discloses a website security monitoring method, which classifies and authenticates the extracted web content information according to text, pictures, and videos, and sets multiple thresholds to prevent misjudgment .
  • This monitoring program There are two major problems with this monitoring program: First, it ignores the role of audio data in identifying bad information; second, it prevents misjudgment by thresholding, and there is still the possibility of misjudgment, not only may misuse websites that do not contain bad information. If it is judged as a bad information website, it is also possible to misjudge the bad information website as a regular website. In short, the monitoring accuracy rate is insufficient.
  • the present invention provides a monitoring system and a monitoring method for website bad information.
  • the present invention protects a website bad information monitoring system, which includes a monitoring device, a voice recognition server that interacts with the monitoring device, and a manual monitoring and checking terminal.
  • the monitoring device obtains the webpage content information of each webpage of the target website, the webpage content information includes at least one of text, picture, audio, and video; for the video information contained in the webpage of the website, the monitoring device extracts the audio data therein as Video-audio information.
  • the monitoring device cuts the audio-visual information and audio information contained in the webpage of the website according to natural sentences to the audio stream, and sends the cut audio segments to the speech recognition server in order.
  • the voice recognition server converts the content of the audio segment into text and returns it to the monitoring device.
  • the monitoring device searches the text for bad keywords, and sends the text matching the bad keywords and its corresponding audio segments and log files
  • the manual monitoring and verification terminal maps the audio segments and texts in a one-to-one correspondence and displays them according to the log file for manual verification;
  • the log files include, but are not limited to, source webpage links, audio segments The start time of the audio segment, the end time of the audio segment, the audio code corresponding to the audio segment, and the text corresponding to the audio segment.
  • the monitoring device authenticates the text, pictures, and videos contained in the webpage of the website, and if it identifies the existence of bad information, it sends the webpage link containing the bad information to the manual monitoring and verification terminal for manual verification.
  • the monitoring device interacts with a natural language processing server, the voice recognition server converts the audio segment content into a primary text and returns it to the monitoring device, and the monitoring device then converts the primary text returned by the voice recognition server It is sent to the natural language processing server, and the natural language processing server automatically corrects the primary text according to the natural language, and returns the corrected secondary text to the monitoring device.
  • a natural language processing server converts the audio segment content into a primary text and returns it to the monitoring device, and the monitoring device then converts the primary text returned by the voice recognition server It is sent to the natural language processing server, and the natural language processing server automatically corrects the primary text according to the natural language, and returns the corrected secondary text to the monitoring device.
  • the monitoring device performs a bad keyword search on the secondary text, and sends the secondary text matching the bad keyword and its corresponding audio segment and log file to the manual monitoring and checking terminal, and the manual monitoring and checking terminal is based on the The log file described the audio segment and the secondary text in a one-to-one correspondence and display for manual verification.
  • the present invention also protects a monitoring method of the above-mentioned website bad information monitoring system, which includes the steps:
  • the monitoring equipment uses a web crawler to obtain the web content information of each web page of the target website.
  • the web content information includes at least one of text, picture, audio, and video.
  • the monitoring equipment extracts the audio data from it as the video-audio information.
  • the monitoring device cuts the audio-visual information and audio information contained in the webpage of the website according to natural sentences, and sends the cut audio segments to the speech recognition server in sequence.
  • the voice recognition server converts the content of the audio segment into text and returns it to the monitoring device, and the monitoring device sends the text returned by the voice recognition server to the natural language processing server.
  • the natural language processing server automatically corrects the primary text according to the natural language, and returns the corrected secondary text to the monitoring device.
  • the monitoring device performs a bad keyword search on the secondary text, and sends the secondary text matching the bad keyword and its corresponding audio segment and log file to the manual monitoring and verification terminal.
  • the manual monitoring and verification terminal will display the audio segment and the secondary text in a one-to-one correspondence according to the log file for manual verification;
  • the log file includes but not limited to the source webpage link, the start time of the audio segment, the end time of the audio segment, The audio code corresponding to the audio segment and the text corresponding to the audio segment.
  • the monitoring equipment authenticates the text, pictures, and videos contained in the webpage of the website, and if it identifies that there is bad information, it sends the webpage link containing the bad information to the manual monitoring and verification terminal for manual verification.
  • the monitoring device After the monitoring device performs a bad keyword search on the secondary text, it analyzes the semantics of the sentence containing the bad keyword through semantic understanding, and after it is judged as bad information, the secondary text and its corresponding audio segment, which are judged as bad information, The log file is sent to the terminal for manual monitoring and verification.
  • the monitoring device numbers each segment of audio and text; if there is no corresponding text in the audio segment, the monitoring device marks it in the log file.
  • the duration of the audio segment is limited to less than 60s.
  • the bad keywords in the text are highlighted.
  • the monitoring device cyclically processes all websites in a certain area concurrently.
  • audio information and audio data in video information are also included in the monitoring range of website bad information.
  • manual monitoring has the problem of heavy workload.
  • Voice recognition There is the problem of low accuracy.
  • the present invention organically combines these two audio data monitoring methods to overcome various difficulties in the combination process. While ensuring the accuracy of monitoring, it greatly reduces the workload of manual monitoring. Very good promotion value; at the same time, the combination of system intelligent identification and manual verification can also be used for the verification of text and pictures (including screenshots of video frames) to improve the accuracy of identifying bad information on the website.
  • Figure 1 is a block diagram of the website's bad information monitoring system.
  • a monitoring system for website bad information includes monitoring equipment, a voice recognition server that interacts with the monitoring equipment, and a manual monitoring and verification terminal.
  • the specific monitoring method includes the following steps:
  • the monitoring equipment uses a web crawler to obtain the web content information of each web page of the target website.
  • the web content information includes at least one of text, picture, audio, and video.
  • the monitoring equipment extracts the audio data from it as the video-audio information.
  • Audio information and audio data in video information also contain a lot of information, which should not be ignored in the monitoring of bad information. Ignoring this information will cause huge loopholes in the monitoring system, and the possibility of being exploited by lawbreakers, resulting in a large number of missed monitoring of bad information.
  • the monitoring device cuts the audio-visual information and audio information contained in the webpage of the website according to natural sentences, and sends the cut audio segments to the speech recognition server in sequence.
  • the audio stream is cut according to natural sentences.
  • One is to ensure the integrity of audio information and prevent audio data loss; the other is to reduce the bandwidth occupied during audio transmission to facilitate fast audio Arriving at the voice recognition server, reducing audio jams on the way to the voice recognition server due to network traffic, which is like on a congested road, bicycles, battery cars, especially pedestrians, can shuttle through the car gap, and the network transmission is the same reason.
  • the audio stream is forcibly cut to avoid excessively long audio segments, which affects the transmission speed of the audio segment and the response speed of the speech recognition server, and ensures the timeliness of the system.
  • the audio stream when the audio stream is cut to form an audio segment, it is independent of the audio stream being generated, which means the end of this audio, and it also means that this audio can be played back, facilitating manual monitoring and verification.
  • the voice recognition server is an existing third-party server.
  • the voice recognition server converts the audio segment content into text and returns it to the monitoring device.
  • the monitoring equipment searches the text for bad keywords, and sends the text matching the bad keywords and its corresponding audio segments and log files to the manual monitoring and verification terminal; among them, the log files include but are not limited to the source webpage link, audio segment The start time of the audio segment, the end time of the audio segment, the audio code corresponding to the audio segment, and the text corresponding to the audio segment.
  • the start time and end time of the audio segment are subject to Beijing time.
  • the start time, end time, and corresponding audio code of the audio segment are information that the monitoring device can obtain during the audio cutting process, and the text corresponding to the audio segment is the text returned by the speech recognition server.
  • the bad keyword database As for the bad keyword database, this comes from the accumulation of years of cracking down on illegal network information. It is also possible to build a bad keyword deep learning model based on the existing bad keyword database and deep learning technology to improve the monitoring rate and accuracy of bad information in the text. .
  • the identification of bad information in pictures can also be processed by image recognition models based on deep learning technology.
  • the manual monitoring and verification terminal will correspond and display the audio segment and the text one-to-one according to the log file for manual verification.
  • the monitoring equipment authenticates the text, pictures, and videos contained in the webpage of the website, and if it identifies the existence of bad information, it sends the webpage link containing the bad information to the manual monitoring and verification terminal for manual verification.
  • the manual monitoring and verification terminal can be connected to the monitoring equipment through the network to get rid of local restrictions. Only by setting up a monitoring and verification office in a fixed place, the monitoring equipment can be checked and monitored in an orderly manner across the country. At the same time, the suspicious bad information sent to the manual monitoring and verification terminal is screened by voice recognition, which greatly reduces the workload of traditional manual monitoring, from blind and passive monitoring and verification to active search and monitoring and verification.
  • the bad keywords in the text can be highlighted; the text can also be displayed in segments according to the audio segment, that is, the text corresponding to an audio segment is displayed as a paragraph.
  • the monitoring checker manually clicks on a certain section of text, the manual monitoring check terminal immediately plays the audio corresponding to the section of text.
  • the audio segment In the transmission process of audio segment and text, the audio segment is large but the text is small, so the text is often transmitted to the manual monitoring and verification terminal earlier than the audio segment, that is, the audio segment and the text are not transmitted to the manual monitoring and verification terminal at the same time, and the manual monitoring and verification How does the terminal know which piece of text corresponds to which piece of audio.
  • this problem is solved by numbering each piece of audio and text by the monitoring device.
  • a piece of audio corresponds to a piece of text, and the correspondence can be done in order, but there may be a possibility that a piece of audio does not correspond to the text, such as a live song.
  • the solution to this problem is that if the audio segment does not have a text corresponding to it, the monitoring device will mark it in the log file, and the manual monitoring and verification terminal will correspond the audio segment to the text one-to-one according to the log file. When there is a mark in an audio segment, skip it to avoid the problem of the corresponding error between the text and the audio segment. How does the monitoring device know which audio segment has no corresponding text?
  • the speech recognition server For example, the start time, end time, audio number, or multiple types of information are merged to form characteristic information. Connected audio The segments are sent to the voice recognition server together, and the voice recognition server returns the text carrying the characteristic information, and the monitoring device can know whether the audio segment has a corresponding text sent.
  • the implementation method is not limited to this.
  • Speech recognition technology converts the vocabulary content in human speech into computer-readable input, such as keystrokes, binary codes, or character sequences. It can be considered as a mechanical conversion of audio into text. There must be a problem of semantic errors, which reduces Improved the accuracy of bad information retrieval. Natural language processing technology studies how to achieve effective communication between humans and computers in natural language, and using it to correct the text generated by the speech recognition server can make up for this defect in the speech recognition technology.
  • the natural language processing server will automatically correct the primary text according to natural language, and return the corrected secondary text to the monitoring device;
  • the secondary text is searched for bad keywords, and the secondary text that matches the bad keywords and its corresponding audio segments and log files are sent to the manual monitoring and verification terminal.
  • the manual monitoring and verification terminal will perform the audio segment and the secondary text according to the log file.
  • One-to-one correspondence and display for manual verification This also improves the accuracy of bad information retrieval and reduces the workload of manual verification.
  • semantic understanding technology is also developing rapidly, and smart speakers based on semantic understanding technology have also become one of the hotter products.
  • Using semantic understanding technology to filter the secondary text again can further remove some "false" bad information, thereby further reducing the workload of manual verification.
  • the identification of text, pictures, and videos it can be based on the existing technology, which will not be repeated here, but after the system's intelligent identification, manual verification is supplemented to improve the accuracy of identification.
  • the monitoring equipment cyclically and concurrently processes all the websites in a certain area, and during the concurrent processing, the same information from the same website and the same webpage is processed to avoid duplication.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

一种网站不良信息监测***及其监测方法,监测***包括:监测设备、与监测设备交互的语音识别服务器和人工监听核查终端。该监测***将音频信息和视频信息中的音频数据纳入网站不良信息监测范围,但是由于音频数据现有的监测手段主要是人工监听和语音识别两种,人工监听存在工作量大的问题,语音识别存在准确率低的问题,该监测***通过将这两种音频数据监测手段有机结合,克服结合过程中存在的各种难题,在保证监测准确性的同时,极大缩减了人工监听的工作量,具有很好的推广价值;同时监测***这种智能鉴别和人工核查结合的方式,也可以用于文本、图片(包含视频帧截图)的核查,提高网站不良信息鉴别的准确性。

Description

一种网站不良信息监测***及其监测方法 技术领域
本发明涉及网络信息安全技术领域,尤其是一种网站不良信息监测***及其监测方法。
背景技术
虚拟的网络世界与现实世界是对等的,现实世界中有美与丑、善与恶,网络世界里同样也有美与丑、善与恶。在互联网刚刚兴起的年代,人们上网主要为了查询资料和信息,彼时不良信息还非常有限。然而,随着互联网的不断发展,人们开始在网络上追求娱乐、寻找商机、阅读新闻,此时不良信息开始逐渐扩散。近几年,不良信息甚至发展成为一种产业,并开始从单纯的“知识型”信息向“谋利型”转变,而且手段多样、形式复杂,其中不乏很多违反法律、违反道德的不良信息,其中以色情类内容为主,还参杂着赌博、诈骗、枪械贩卖等违法内容。打击发布不良信息的网站,一直是我国有关单位的重要职责。如何快速从数量庞大的众多网站中查找出含有不良信息的,是网络信息安全的主要研究方向之一。
技术问题
国家知识产权局于2018年1月5日公开的发明专利申请CN107547555A公开了一种网站安全监测方法,将提取的网页内容信息按照文本、图片、视频进行分类鉴别,并通过设置多重阈值防止误判。此监测方案存在两大问题:第一,忽略了音频数据在不良信息鉴别中的作用;第二,通过阈值防止误判,仍然存在误判的可能性,不仅可能将不包含不良信息的网站误判成不良信息网站,还有可能将不良信息网站误判成正规网站,简而言之,就是监测准确率不足。
技术解决方案
针对上述问题,本发明提供一种网站不良信息监测***及其监测方法。
本发明保护一种网站不良信息监测***,包括监测设备、与所述监测设备交互的语音识别服务器和人工监听核查终端。
所述监测设备获取目标网站各个网页的网页内容信息,网页内容信息至少包含文本、图片、音频、视频中的一种;对于该网站网页中包含的视频信息,监测设备提取其中的音频数据,作为视-音频信息。
所述监测设备将该网站网页中包含的将视-音频信息和音频信息按照自然句对音频流进行切割,并将切割后的音频段按序发送至所述语音识别服务器。
所述语音识别服务器将音频段内容转换成文本并返回至所述监测设备,所述监测设备对文本进行不良关键词检索,并将匹配到不良关键词的文本及其对应音频段、日志文件发送至所述人工监听核查终端,所述人工监听核查终端根据所述日志文件将音频段和文本进行一一对应并显示,以供人工核查;所述日志文件包括但不限于来源网页链接、音频段的开始时间、音频段的结束时间、音频段对应的音频代码和音频段对应的文本。
所述监测设备对该网站网页中包含的文本、图片和视频进行鉴别,若鉴别出存在不良信息,则将包含不良信息的网页链接发送至所述人工监听核查终端,以供人工核查。
进一步的,所述监测设备与自然语言处理服务器交互,所述语音识别服务器将音频段内容转换成一次文本并返回至所述监测设备,所述监测设备再将所述语音识别服务器返回的一次文本发送至所述自然语言处理服务器,所述自然语言处理服务器将一次文本按照自然语言进行自动修正,并将修正后的二次文本返回至所述监测设备。
所述监测设备对二次文本进行不良关键词检索,并将匹配到不良关键词的二次文本及其对应音频段、日志文件发送至所述人工监听核查终端,所述人工监听核查终端根据所述日志文件将音频段和二次文本进行一一对应并显示,以供人工核查。
本发明还保护一种上述网站不良信息监测***的监测方法,包括步骤:
1、监测设备利用网路爬虫获取目标网站各个网页的网页内容信息,网页内容信息至少包含文本、图片、音频、视频中的一种。
2、对于该网站网页中包含的视频信息,监测设备提取其中的音频数据,作为视-音频信息。
3、监测设备将该网站网页中包含的将视-音频信息和音频信息按照自然句对音频流进行切割,并将切割后的音频段按序发送至语音识别服务器。
4、语音识别服务器将音频段内容转换成文本并返回至监测设备,监测设备再将语音识别服务器返回的一次文本发送至自然语言处理服务器。
5、自然语言处理服务器将一次文本按照自然语言进行自动修正,并将修正后的二次文本返回至监测设备。
6、监测设备对二次文本进行不良关键词检索,并将匹配到不良关键词的二次文本及其对应音频段、日志文件发送至所述人工监听核查终端。
7、人工监听核查终端根据日志文件将音频段和二次文本进行一一对应并显示,以供人工核查;日志文件包括但不限于来源网页链接、音频段的开始时间、音频段的结束时间、音频段对应的音频代码和音频段对应的文本。
8、监测设备对该网站网页中包含的文本、图片和视频进行鉴别,若鉴别出存在不良信息,则将包含不良信息的网页链接发送至所述人工监听核查终端,以供人工核查。
9、人工核查过程中,首先判断不良信息是否为纯误判,若不是,则根据不良信息情况对该网站性质分类,判断其是属于正规网站包含了些许不良信息,还是其本身就是不良网站。
优选的,监测设备对二次文本进行不良关键词检索之后,通过语义理解分析包含不良关键词语句的语义,判定为不良信息之后,再将判定为不良信息的二次文本及其对应音频段、日志文件发送至人工监听核查终端。
优选的,监测设备对每一段音频和文本进行编号;若音频段没有对应文本,监测设备在日志文件中予以标记。
优选的,音频段时长限制在60s以内。
优选的,对于文本中的不良关键词进行突出显示。
优选的,监测设备对一定区域内的所有网站循环并发处理。
有益效果
本发明将音频信息和视频信息中的音频数据也纳入网站不良信息监测范围,但是由于音频数据现有的监测手段主要是人工监听和语音识别两种,人工监听存在工作量大的问题,语音识别存在准确率低的问题,本发明通过将这两种音频数据监测手段有机结合,克服结合过程中存在的各种难题,在保证监测准确性的同时,极大缩减了人工监听的工作量,具有很好的推广价值;同时***智能鉴别和人工核查结合的方式,同样可以用于文本、图片(包含视频帧截图)的核查,提高网站不良信息鉴别的准确性。
附图说明
图1为网站不良信息监测***的结构框图。
本发明的实施方式
下面结合附图和具体实施方式对本发明作进一步详细的说明。本发明的实施例是为了示例和描述起见而给出的,而并不是无遗漏的或者将本发明限于所公开的形式。很多修改和变化对于本领域的普通技术人员而言是显而易见的。选择和描述实施例是为了更好说明本发明的原理和实际应用,并且使本领域的普通技术人员能够理解本发明从而设计适于特定用途的带有各种修改的各种实施例。
实施例
一种网站不良信息监测***,如图1所示,包括监测设备、与所述监测设备交互的语音识别服务器和人工监听核查终端,具体监测方法,包括以下步骤:
1、监测设备利用网路爬虫获取目标网站各个网页的网页内容信息,网页内容信息至少包含文本、图片、音频、视频中的一种。
2、对于该网站网页中包含的视频信息,监测设备提取其中的音频数据,作为视-音频信息。
音频信息以及视频信息中的音频数据也包含了大量信息,在不良信息监测中不应当被忽略。忽略这些信息将使得监测***产生巨大漏洞,存在被不法分子钻空子的可能性,造成不良信息的大量漏监。
3、监测设备将该网站网页中包含的将视-音频信息和音频信息按照自然句对音频流进行切割,并将切割后的音频段按序发送至语音识别服务器。
人在正常说话时是有停顿的,按照自然句进行音频流切割,一是可以保证音频信息地完整性,防止音频数据丢失的情况发生;二是减少音频发送过程中占用的带宽,便于音频快速到达语音识别服务器,减少因网络塞车导致音频堵塞于发往语音识别服务器的路途当中,这就好比在一条拥堵的马路上,自行车、电瓶车,尤其是行人,可以从汽车缝隙中穿梭,网络传输同理。
如果在60s内都没有检测到足够长时间的停顿,那么强行对音频流进行切割,避免音频段过长,影响音频段的传输速度以及语音识别服务器的响应速度,确保***时效性。
此外,当音频流被切割形成音频段时,它与正在生成的音频流就独立开来,意味着这段音频的结束,也意味着可以对这段音频进行回放,便于人工监听核查。
4、语音识别服务器为既有的第三方服务器,语音识别服务器将音频段内容转换成文本并返回至监测设备。
5、监测设备对文本进行不良关键词检索,并将匹配到不良关键词的文本及其对应音频段、日志文件发送至人工监听核查终端;其中,日志文件包括但不限于来源网页链接、音频段的开始时间、音频段的结束时间、音频段对应的音频代码和音频段对应的文本。
音频段的开始时间、结束时间均以北京时间为准。音频段的开始时间、结束时间、及其对应的音频代码是监测设备在音频切割过程中就能够获取的信息,音频段对应的文本是语音识别服务器返回的文本。
至于不良关键词库,这来自于多年打击非法网络信息的积累,也可以基于现有不良关键词库和深度学习技术,搭建不良关键词深度学习模型,提高文本中不良信息的监测速率和准确率。对于图片(包含视频帧截图)中不良信息的鉴别,亦可通过基于深度学习技术的图像识别模型来处理。
6、人工监听核查终端根据日志文件将音频段和文本进行一一对应并显示,以供人工核查。
7、监测设备对该网站网页中包含的文本、图片和视频进行鉴别,若鉴别出存在不良信息,则将包含不良信息的网页链接发送至所述人工监听核查终端,以供人工核查。
人工监听核查终端可以与监测设备通过网络连接,摆脱局域限制,仅需在固定场所设置监听核查办公室,即可对全国各处的监测设备进行有序地核查监听。与此同时,发送至人工监听核查终端的可疑不良信息是经过语音识别筛选过的,这极大地缩减了传统人工监听的工作量,从盲目被动监听核查,变成了积极主动查找监听核查。
为进一步提高音频监听核查效率上,可以对于文本中的不良关键词进行突出显示;还可以按照音频段对文本进行分段显示,即一个音频段对应的文本显示为一段。监听核查人员手动点击某段文本时,人工监听核查终端立即播放该段文本对应的音频。
在音频段和文本的传输过程中,音频段大而文本小,因此文本往往比音频段更早地传输到人工监听核查终端,即音频段和文本并非同时传输到人工监听核查终端,人工监听核查终端如何知晓哪一段文本要对应哪一段音频。在本实施例中,通过监测设备对每一段音频和文本进行编号来解决这一问题。
理想情况下,一段音频对应一段文字,按照顺序进行对应即可,但是可能存在一段音频没有对应文字的可能性,如现场播放歌曲等情况。这就涉及到如何将语音识别服务器返回的文本与音频段一一对应的问题。本实施例中,解决这一问题的方法是,若音频段没有与之对应的文本,监测设备在日志文件中予以标记,人工监听核查终端根据日志文件将音频段和文本进行一一对应,如果遇到某个音频段存在标记,就将其跳过,以免出现文本与音频段对应错误的问题出现。监测设备如何知晓哪一段音频段没有对应的文本,这是通过语音识别服务器返回的数据判断,例如将开始时间、结束时间、音频编号其中的一种信息或多种信息进行融合形成特征信息连通音频段一起发送给语音识别服务器,语音识别服务器返回携带该特征信息的文本,监控设备就可以知晓此音频段有没有对应文本发送过来。当然,实现方法不限于此。
语音识别技术是将人类语音中的词汇内容转换为计算机可读的输入,例如按键、二进制编码或者字符序列,可以认为是机械式地将音频转换为文字,其中肯定存在语义错误的问题,这降低了对于不良信息检索的准确率。自然语言处理技术研究的是如何实现人与计算机之间用自然语言进行有效通信,利用其对语音识别服务器产生的文本进行修正,能够弥补语音识别技术中存在的这一缺陷。
也就是说,如果将语音识别服务器输出的文本定义为一次文本,那么通过自然语言处理服务器将一次文本按照自然语言进行自动修正,并将修正后的二次文本返回至监测设备;监测设备对二次文本进行不良关键词检索,并将匹配到不良关键词的二次文本及其对应音频段、日志文件发送至人工监听核查终端,人工监听核查终端根据日志文件将音频段和二次文本进行一一对应并显示,以供人工核查。这样也提高了不良信息检索的准确率,减少人工核查的工作量。
此外,现如今语义理解技术也发展迅猛,基于语义理解技术的智能音箱也成为当下较为火热的产品之一。将语义理解技术用于对二次文本进行再一次筛选,可以进一步去除掉一些“虚假”不良信息,从而进一步减少人工核查的工作量。
至于文本、图片和视频的鉴别,可基于现有技术,这里不再赘述,只是在***智能鉴别之后,辅以人工核查,提高鉴别的准确率。
8、人工核查过程中,首先判断不良信息是否为纯误判,若不是,则根据不良信息情况对该网站性质分类,判断其是属于正规网站包含了些许不良信息,还是其本身就是不良网站。前者,责令其整改;后者,严肃处理。
为了提升***的监测效率,监测设备对一定区域内的所有网站循环并发处理,并在并发处理过程中,对于来自于同一网站同一网页的相同信息进行避重处理。
显然,所描述的实施例仅仅是本发明的一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域及相关领域的普通技术人员在没有作出创造性劳动的前提下所获得的所有其他实施例,都应属于本发明保护的范围。

Claims (10)

  1. 一种网站不良信息监测***,其特征在于,包括监测设备、与所述监测设备交互的语音识别服务器和人工监听核查终端;
    所述监测设备获取目标网站各个网页的网页内容信息,网页内容信息至少包含文本、图片、音频、视频中的一种;对于该网站网页中包含的视频信息,监测设备提取其中的音频数据,作为视-音频信息;
    所述监测设备将该网站网页中包含的将视-音频信息和音频信息按照自然句对音频流进行切割,并将切割后的音频段按序发送至所述语音识别服务器;
    所述语音识别服务器将音频段内容转换成文本并返回至所述监测设备,所述监测设备对文本进行不良关键词检索,并将匹配到不良关键词的文本及其对应音频段、日志文件发送至所述人工监听核查终端,所述人工监听核查终端根据所述日志文件将音频段和文本进行一一对应并显示,以供人工核查;
    所述监测设备对该网站网页中包含的文本、图片和视频进行鉴别,若鉴别出存在不良信息,则将包含不良信息的网页链接发送至所述人工监听核查终端,以供人工核查。
  2. 根据权利要求1所述的网站不良信息监测***,其特征在于,所述日志文件包括但不限于来源网页链接、音频段的开始时间、音频段的结束时间、音频段对应的音频代码和音频段对应的文本。
  3. 根据权利要求1所述的网站不良信息监测***,其特征在于,所述监测设备与自然语言处理服务器交互,所述语音识别服务器将音频段内容转换成一次文本并返回至所述监测设备,所述监测设备再将所述语音识别服务器返回的一次文本发送至所述自然语言处理服务器,所述自然语言处理服务器将一次文本按照自然语言进行自动修正,并将修正后的二次文本返回至所述监测设备;
    所述监测设备对二次文本进行不良关键词检索,并将匹配到不良关键词的二次文本及其对应音频段、日志文件发送至所述人工监听核查终端,所述人工监听核查终端根据所述日志文件将音频段和二次文本进行一一对应并显示,以供人工核查。
  4. 一种权利要求3所述的网站不良信息监测***的监测方法,其特征在于,包括以下步骤:
    步骤1,监测设备利用网路爬虫获取目标网站各个网页的网页内容信息,网页内容信息至少包含文本、图片、音频、视频中的一种;
    步骤2,对于该网站网页中包含的视频信息,监测设备提取其中的音频数据,作为视-音频信息;
    步骤3,监测设备将该网站网页中包含的将视-音频信息和音频信息按照自然句对音频流进行切割,并将切割后的音频段按序发送至语音识别服务器;
    步骤4,语音识别服务器将音频段内容转换成一次文本并返回至监测设备,监测设备再将语音识别服务器返回的一次文本发送至自然语言处理服务器;
    步骤5,自然语言处理服务器将一次文本按照自然语言进行自动修正,并将修正后的二次文本返回至监测设备;
    步骤6,监测设备对二次文本进行不良关键词检索,并将匹配到不良关键词的二次文本及其对应音频段、日志文件发送至人工监听核查终端;
    步骤7,人工监听核查终端根据日志文件将音频段和二次文本进行一一对应并显示,以供人工核查;
    步骤8,监测设备对该网站网页中包含的文本、图片和视频进行鉴别,若鉴别出存在不良信息,则将包含不良信息的网页链接发送至所述人工监听核查终端,以供人工核查;
    步骤9,人工核查过程中,首先判断不良信息是否为纯误判,若不是,则根据不良信息情况对该网站性质分类,判断其是属于正规网站包含了些许不良信息,还是其本身就是不良网站。
  5. 根据权利要求4所述的监测方法,其特征在于,日志文件包括但不限于来源网页链接、音频段的开始时间、音频段的结束时间、音频段对应的音频代码和音频段对应的文本。
  6. 根据权利要求5所述的监测方法,其特征在于,监测设备对二次文本进行不良关键词检索之后,通过语义理解分析包含不良关键词语句的语义,判定为不良信息之后,再将判定为不良信息的二次文本及其对应音频段、日志文件发送至人工监听核查终端。
  7. 根据权利要求6所述的监测方法,其特征在于,监测设备对每一段音频和文本进行编号;若音频段没有对应文本,监测设备在日志文件中予以标记。
  8. 根据权利要求6所述的监测方法,其特征在于,音频段时长限制在60s以内。
  9. 根据权利要求6所述的监测方法,其特征在于,对于文本中的不良关键词进行突出显示。
  10. 根据权利要求6所述的监测方法,其特征在于,监测设备对一定区域内的所有网站循环并发处理。
PCT/CN2020/132692 2020-03-05 2020-11-30 一种网站不良信息监测***及其监测方法 WO2021174926A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010146566.0 2020-03-05
CN202010146566.0A CN111383660B (zh) 2020-03-05 2020-03-05 一种网站不良信息监测***及其监测方法

Publications (1)

Publication Number Publication Date
WO2021174926A1 true WO2021174926A1 (zh) 2021-09-10

Family

ID=71218692

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/132692 WO2021174926A1 (zh) 2020-03-05 2020-11-30 一种网站不良信息监测***及其监测方法

Country Status (2)

Country Link
CN (1) CN111383660B (zh)
WO (1) WO2021174926A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111383660B (zh) * 2020-03-05 2023-07-14 安徽声讯信息技术有限公司 一种网站不良信息监测***及其监测方法
CN113516997A (zh) * 2021-04-26 2021-10-19 常州分音塔科技有限公司 一种语音事件识别装置和方法

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070234398A1 (en) * 2006-03-02 2007-10-04 Thomas Muehlbauer Controlling Access to Digital Media Content
CN104754374A (zh) * 2015-04-03 2015-07-01 北京奇虎科技有限公司 音视频文件检测管理方法及装置
CN106250837A (zh) * 2016-07-27 2016-12-21 腾讯科技(深圳)有限公司 一种视频的识别方法、装置和***
CN108806668A (zh) * 2018-06-08 2018-11-13 国家计算机网络与信息安全管理中心 一种音视频多维度标注与模型优化方法
CN109508402A (zh) * 2018-11-15 2019-03-22 上海指旺信息科技有限公司 违规用语检测方法及装置
CN110085213A (zh) * 2019-04-30 2019-08-02 广州虎牙信息科技有限公司 音频的异常监控方法、装置、设备和存储介质
CN110598075A (zh) * 2019-08-21 2019-12-20 成都信息工程大学 一种基于人工智能的互联网媒体内容安全监测***及方法
CN110837615A (zh) * 2019-11-05 2020-02-25 福建省趋普物联科技有限公司 广告内容信息过滤人工智能审核***
CN111383660A (zh) * 2020-03-05 2020-07-07 安徽声讯信息技术有限公司 一种网站不良信息监测***及其监测方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20050106792A (ko) * 2004-05-06 2005-11-11 권창희 고객 맞춤형 음성서비스 시스템
US8880107B2 (en) * 2011-01-28 2014-11-04 Protext Mobility, Inc. Systems and methods for monitoring communications
CN106888194A (zh) * 2015-12-16 2017-06-23 国家电网公司 基于分布式调度的智能电网it资产安全监测***
CN106100777B (zh) * 2016-05-27 2018-08-17 西华大学 基于语音识别技术的广播保障方法
CN110287315A (zh) * 2019-05-27 2019-09-27 厦门快商通信息咨询有限公司 舆情确定方法、装置、设备及存储介质

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070234398A1 (en) * 2006-03-02 2007-10-04 Thomas Muehlbauer Controlling Access to Digital Media Content
CN104754374A (zh) * 2015-04-03 2015-07-01 北京奇虎科技有限公司 音视频文件检测管理方法及装置
CN106250837A (zh) * 2016-07-27 2016-12-21 腾讯科技(深圳)有限公司 一种视频的识别方法、装置和***
CN108806668A (zh) * 2018-06-08 2018-11-13 国家计算机网络与信息安全管理中心 一种音视频多维度标注与模型优化方法
CN109508402A (zh) * 2018-11-15 2019-03-22 上海指旺信息科技有限公司 违规用语检测方法及装置
CN110085213A (zh) * 2019-04-30 2019-08-02 广州虎牙信息科技有限公司 音频的异常监控方法、装置、设备和存储介质
CN110598075A (zh) * 2019-08-21 2019-12-20 成都信息工程大学 一种基于人工智能的互联网媒体内容安全监测***及方法
CN110837615A (zh) * 2019-11-05 2020-02-25 福建省趋普物联科技有限公司 广告内容信息过滤人工智能审核***
CN111383660A (zh) * 2020-03-05 2020-07-07 安徽声讯信息技术有限公司 一种网站不良信息监测***及其监测方法

Also Published As

Publication number Publication date
CN111383660A (zh) 2020-07-07
CN111383660B (zh) 2023-07-14

Similar Documents

Publication Publication Date Title
Xue et al. Detecting fake news by exploring the consistency of multimodal data
CN110516067B (zh) 基于话题检测的舆情监控方法、***及存储介质
JP6901816B2 (ja) エンティティ関係データ生成方法、装置、機器、及び記憶媒体
WO2019227710A1 (zh) 网络舆情的分析方法、装置及计算机可读存储介质
US8452772B1 (en) Methods, systems, and articles of manufacture for addressing popular topics in a socials sphere
CN104850554B (zh) 一种搜索方法和***
WO2018045646A1 (zh) 基于人工智能的人机交互方法和装置
WO2018201600A1 (zh) 信息挖掘方法、***、电子装置及可读存储介质
JP2020030408A (ja) オーディオにおける重要語句を認識するための方法、装置、機器及び媒体
CN108920649B (zh) 一种信息推荐方法、装置、设备和介质
WO2022156065A1 (zh) 一种文本情感分析方法、装置、设备及存储介质
WO2021174926A1 (zh) 一种网站不良信息监测***及其监测方法
CN106503907B (zh) 一种业务评估信息确定方法以及服务器
WO2019216980A1 (en) Adaptive interface in a voice-activated network
CN111324797A (zh) 一种高速精准获取数据的方法和装置
CN104731874A (zh) 一种评价信息生成方法和装置
CN107545505B (zh) 保险理财产品信息的识别方法及***
CN111279333A (zh) 对网络中的数字内容的基于语言的搜索
CN113038153A (zh) 金融直播违规检测方法、装置、设备及可读存储介质
WO2024055603A1 (zh) 一种未成年人文本识别方法及装置
CN107688594B (zh) 基于社交信息的风险事件的识别***及方法
KR102483004B1 (ko) 유해 url 탐지 방법
CN111209750A (zh) 车联网威胁情报建模方法、装置及可读存储介质
CN108985059B (zh) 一种网页后门检测方法、装置、设备及存储介质
US20240048589A1 (en) Harmful url detection method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20923645

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20923645

Country of ref document: EP

Kind code of ref document: A1