CN112000929A - Cross-platform data analysis method, system, equipment and readable storage medium - Google Patents

Cross-platform data analysis method, system, equipment and readable storage medium Download PDF

Info

Publication number
CN112000929A
CN112000929A CN202010746899.7A CN202010746899A CN112000929A CN 112000929 A CN112000929 A CN 112000929A CN 202010746899 A CN202010746899 A CN 202010746899A CN 112000929 A CN112000929 A CN 112000929A
Authority
CN
China
Prior art keywords
analysis
data
cross
module
platform
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010746899.7A
Other languages
Chinese (zh)
Inventor
吴小坤
朱鸿军
赵甜芳
谷刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Zhicheng Technology Co ltd
Original Assignee
Guangzhou Zhicheng Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Zhicheng Technology Co ltd filed Critical Guangzhou Zhicheng Technology Co ltd
Priority to CN202010746899.7A priority Critical patent/CN112000929A/en
Publication of CN112000929A publication Critical patent/CN112000929A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Computer Security & Cryptography (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Technology Law (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a cross-platform data analysis method, a system, equipment and a readable storage medium, wherein the method comprises the following steps: a cross-platform data analysis system receives target analysis content sent by user equipment, and a text retrieval module and a text analysis module are embedded into the cross-platform data analysis system; searching the keywords according to the user requirements, and extracting search data; performing feature mining and intelligent analysis on the retrieval data based on a statistical learning method, a natural language processing technology, a machine learning method and a deep learning method to obtain an analysis result; and judging whether the infringement condition is met or not according to the analysis result, and if so, generating an infringement report and comparing the analysis result for downloading by the user equipment. By means of big data acquisition and analysis and with the help of an improved text matching technology, the method can finish the retrieval and acquisition of infringing content in a new media era, and provides an effective solution for the copyright protection problem of network content.

Description

Cross-platform data analysis method, system, equipment and readable storage medium
Technical Field
The embodiment of the application relates to the technical field of data processing, in particular to a cross-platform data analysis method, a cross-platform data analysis system, a cross-platform data analysis device and a readable storage medium.
Background
The existing copyright protection method comprises three types of manual retrieval, maintenance outsourcing and platform self-testing.
The traditional 'manual retrieval' method is suitable for personal media or small teams, content titles or keywords are input in a search engine or a public platform to retrieve similar content, and plagiarism is judged manually. The disadvantages are obvious, the first is low efficiency, depends on manpower greatly, and is not suitable for teams with high yield, less staff and weak capital; secondly, a great number of missed fishes exist, which is caused by the rule setting of the search engine, so that a certain part of contents cannot be recorded in time, and cannot be directly retrieved by the search engine; and thirdly, evidence is lacked, so that obvious copy plagiarism can be judged at a glance, but the content source subjected to reorganization and de-duplication processing is often ambiguous, and an inexperienced team is difficult to grasp.
The 'external right package' is an on-line copyright protection product emerging in recent years, and a typical representative is a right knight. The first use is the complaint of WeChat platforms, namely the off-shelf mechanism, to knock down plagiarized public numbers, and then the development is the copyright protection platform for whole-network monitoring. However, such a technical problem of maintaining rights is that only large segments of copies and duplications can be detected, and the manuscript washing content cannot be judged, so that the copyright protection of the original content cannot be effectively realized.
"platform self-test" is a method introduced by WeChat public, known-equal social media platforms. The method filters the copy content through detection before content release, has the advantages of killing off irregular reprinting and copy wind from the source, ensuring the quality of the whole platform article, and has the defects of incapability of identifying the manuscript washing article and limitation to the platform. In addition, "reader complaints" is also a way of platform self-test, and has the advantages of saving manpower and fully mobilizing the enthusiasm of reader groups. The disadvantage is that the complaint processing result is greatly influenced by the subjectivity of the reader, and how to set the fair evaluation index for both the reader and the writer is a difficult problem faced by the platform.
Disclosure of Invention
Therefore, the embodiment of the application provides a cross-platform data analysis method, a cross-platform data analysis system, a cross-platform data analysis device and a readable storage medium, through big data acquisition and analysis and by means of an improved text matching technology, the retrieval and acquisition of infringing content in a new media era can be completed, and an effective solution is provided for the copyright protection problem of network content.
In order to achieve the above object, the embodiments of the present application provide the following technical solutions:
according to a first aspect of embodiments of the present application, there is provided a cross-platform data analysis method, the method including:
a cross-platform data analysis system receives target analysis content sent by user equipment, and a text retrieval module and a text analysis module are embedded into the cross-platform data analysis system;
searching the keywords according to the user requirements, and extracting search data;
performing feature mining and intelligent analysis on the retrieval data based on a statistical learning method, a natural language processing technology, a machine learning method and a deep learning method to obtain an analysis result;
and judging whether the infringement condition is met or not according to the analysis result, and if so, generating an infringement report and comparing the analysis result for downloading by the user equipment.
Optionally, the retrieving the keyword according to the user requirement and extracting the retrieval data includes:
setting a session box in the cross-platform data analysis system, so that a user inputs content keywords to be retrieved according to requirements and eliminates redundant information;
obtaining related content on an internet platform in a specific time period, wherein the related content belongs to the user account in a database of the cross-platform data analysis system;
and searching the keywords according to the user requirements, and extracting the searched data.
Optionally, the performing feature mining and intelligent analysis on the retrieval data based on a statistical learning method, a natural language processing technology, a machine learning method, and a deep learning method to obtain an analysis result includes:
and shortening the mapping into paragraphs, adding a semantic distribution matching algorithm on the basis of vocabulary literal matching, and performing similarity analysis on the retrieval data by combining algorithms of causal reasoning, keyword extraction and summary extraction to obtain an analysis result.
Optionally, the determining, according to the analysis result, whether an infringement condition is met, and if so, generating an infringement report and comparing the analysis result, including:
and comparing the searched suspected manuscript washing and plagiarism articles with the target analysis content, and generating an infringement litigation material sample and a data analysis report when the infringement condition is reached.
According to a second aspect of embodiments of the present application, there is provided a cross-platform data analysis system, the system comprising:
the information receiving module is used for receiving target analysis content sent by user equipment, and the cross-platform data analysis system is embedded into the text retrieval module and the text analysis module;
the retrieval module is used for retrieving the keywords according to the user requirements and extracting retrieval data;
the analysis module is used for carrying out feature mining and intelligent analysis on the retrieval data based on a statistical learning method, a natural language processing technology, a machine learning method and a deep learning method to obtain an analysis result;
and the result judgment module is used for judging whether the infringement condition is met or not according to the analysis result, and if so, generating an infringement report and comparing the analysis result for downloading by the user equipment.
Optionally, the retrieval module is specifically configured to:
setting a session box in the cross-platform data analysis system, so that a user inputs content keywords to be retrieved according to requirements and eliminates redundant information;
obtaining related content on an internet platform in a specific time period, wherein the related content belongs to the user account in a database of the cross-platform data analysis system;
and searching the keywords according to the user requirements, and extracting the searched data.
Optionally, the analysis module is specifically configured to:
and shortening the mapping into paragraphs, adding a semantic distribution matching algorithm on the basis of vocabulary literal matching, and performing similarity analysis on the retrieval data by combining algorithms of causal reasoning, keyword extraction and summary extraction to obtain an analysis result.
Optionally, the result determining module is specifically configured to:
and comparing the searched suspected manuscript washing and plagiarism articles with the target analysis content, and generating an infringement litigation material sample and a data analysis report when the infringement condition is reached.
According to a third aspect of embodiments herein, there is provided an apparatus comprising: the device comprises a data acquisition device, a processor and a memory; the data acquisition device is used for acquiring data; the memory is to store one or more program instructions; the processor is configured to execute one or more program instructions to perform the method of any of the first aspect.
According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium having one or more program instructions embodied therein for performing the method of any of the first aspects.
To sum up, the embodiment of the present application provides a cross-platform data analysis method, a system, a device and a readable storage medium, wherein a cross-platform data analysis system receives target analysis content sent by a user device, and a text retrieval module and a text analysis module are embedded in the cross-platform data analysis system; searching the keywords according to the user requirements, and extracting search data; performing feature mining and intelligent analysis on the retrieval data based on a statistical learning method, a natural language processing technology, a machine learning method and a deep learning method to obtain an analysis result; and judging whether the infringement condition is met or not according to the analysis result, and if so, generating an infringement report and comparing the analysis result for downloading by the user equipment. By means of big data acquisition and analysis and with the help of an improved text matching technology, the method can finish the retrieval and acquisition of infringing content in a new media era, and provides an effective solution for the copyright protection problem of network content.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.
The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so that those skilled in the art can understand and read the present invention, and do not limit the conditions for implementing the present invention, so that the present invention has no technical significance, and any structural modifications, changes in the ratio relationship, or adjustments of the sizes, without affecting the functions and purposes of the present invention, should still fall within the scope of the present invention.
Fig. 1 is a schematic flowchart of a cross-platform data analysis method according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of an embodiment of a cross-platform data analysis method provided in an embodiment of the present application;
fig. 3 is a block diagram of a cross-platform data analysis system according to an embodiment of the present application.
Detailed Description
The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Based on the problems in the prior art, the embodiment of the application provides a cross-platform data analysis method, which comprises the steps of firstly, ensuring the coverage of related contents and cross-platform content tracking through whole network data acquisition and analysis; secondly, on the basis of natural language processing technologies such as keyword extraction, cause and effect reasoning, abstract extraction and the like, original text matching is improved into more intelligent paragraph matching and even article matching, so that plagiarism and manuscript washing traces can be found, scores of plagiarism degree and influence range are given through matching results, and automatic and intelligent copyright protection is realized; in addition, the problem of difficult right maintenance of small and medium-sized teams can be solved through an automatic litigation document generating and sending mechanism, and the labor burden of large enterprise teams is reduced.
As shown in fig. 1, the method comprises the steps of:
step 101: the cross-platform data analysis system receives target analysis content sent by user equipment, and the cross-platform data analysis system is embedded with a text retrieval module and a text analysis module.
Step 102: and searching the keywords according to the user requirements, and extracting the searched data.
Step 103: and carrying out feature mining and intelligent analysis on the retrieval data based on a statistical learning method, a natural language processing technology, a machine learning method and a deep learning method to obtain an analysis result.
Step 104: and judging whether the infringement condition is met or not according to the analysis result, and if so, generating an infringement report and comparing the analysis result for downloading by the user equipment.
In a possible implementation manner, in step 102, the retrieving the keywords according to the user requirement and extracting the retrieved data includes:
setting a session box in the cross-platform data analysis system, so that a user inputs content keywords to be retrieved according to requirements and eliminates redundant information; obtaining related content on an internet platform in a specific time period, wherein the related content belongs to the user account in a database of the cross-platform data analysis system;
and searching the keywords according to the user requirements, and extracting the searched data.
In a possible implementation manner, in step 103, performing feature mining and intelligent analysis on the search data based on a statistical learning method, a natural language processing technique, a machine learning method, and a deep learning method to obtain an analysis result, including:
and shortening the mapping into paragraphs, adding a semantic distribution matching algorithm on the basis of vocabulary literal matching, and performing similarity analysis on the retrieval data by combining algorithms of causal reasoning, keyword extraction and summary extraction to obtain an analysis result.
In a possible implementation manner, in step 104, the determining whether an infringement condition is met according to the analysis result, and if so, generating an infringement report and comparing the analysis result, including:
and comparing the searched suspected manuscript washing and plagiarism articles with the target analysis content, and generating an infringement litigation material sample and a data analysis report when the infringement condition is reached.
The embodiment of the application is based on a cross-platform data acquisition scheme and a text matching technology, and can be used for copyright protection. Firstly, a cross-platform network data acquisition system is built, and the capture of the Internet public content data can be realized. The acquisition system is embedded with a text retrieval and text analysis module, and can extract related data in a keyword retrieval mode according to user requirements. Meanwhile, the existing text content matching method is improved in a targeted manner, so that the method is suitable for identifying similar manuscripts in a new media environment, articles suspected of being washed can be found, and corresponding clues and evidences are provided. And then, when the user initiates session information and initiates a request to the server, the system carries out information acquisition and analysis and returns a suspected infringement content list and a content sample for the user. And finally, by means of the modularized legal document template, providing a sample service of copyright litigation material for the user. By means of the big data acquisition and analysis system and the improved text matching technology, the method can complete the retrieval and acquisition of infringing content in the new media era, and provides an effective solution for the copyright protection problem of network content.
In order to make the cross-platform data analysis method provided in the embodiment of the present application clearer, the embodiment of the cross-platform data analysis system provided in the embodiment of the present application and illustrated in fig. 2 is now further illustrated.
The user original content is input into the front-end interaction layer, processed original data are input into the data acquisition layer by the front-end interaction layer, the data acquisition layer sends the acquired data to the data analysis layer, and the data analysis layer sends the analyzed result to the data storage layer for storage. The front-end interaction layer comprises a terminal adaptation module, a user interaction module and a visualization module, the data acquisition layer comprises a data source analysis module, a crawler middleware, a downloading middleware and an alarm module, the data analysis layer comprises a text matching module, a content analysis module, a theme analysis module and an evidence generation module, and the data storage layer comprises a redis module, a hbase module and a mysql module.
A cross-platform data acquisition module: the method can realize the collection of network public content, and focuses on new media and articles released or reprinted by independent creators. And the data is stored in the cloud server to wait for establishing connection with the client. The cross-platform data acquisition module is characterized by real-time performance, high concurrency and high robustness. The distributed crawler and the hierarchical storage framework are adopted by the module to improve concurrency and response speed, the expansibility and the robustness of the system are improved by means of the middleware technology, and the system fault is processed in real time through the alarm module, so that the efficient normal operation of the system is guaranteed.
A user interaction module: after logging in the system, a user inputs keywords, ambiguous words and exclusion words, sets a time range of contents to be acquired, and initiates an inquiry request of infringing contents to a server; the system automatically collects the content of similar keywords and topics based on the established big data acquisition platform according to an execution request initiated by a user. The interactive module design of the user and the system is characterized by reasonableness and usability. The module has the specific functions of: and a conversation frame is arranged in the system, so that a user can input content keywords to be retrieved according to requirements, eliminate redundant information and acquire related content on the Internet platform in a specific time period, wherein the content belongs to the user account in the database system. All data can be downloaded or used for secondary retrieval. When a user needs to process a plurality of related contents, a separate scheme may be provided to implement the above steps.
A data analysis module: the module performs feature mining and intelligent analysis of data based on the existing statistical learning method, natural language processing technology, machine learning method, deep learning method and the like. The data analysis module is characterized by scientificity and intelligence. The module specifically comprises: the user can input the title and content of the works of the user or others, the closest content is retrieved in the collected data sample, the system provides a similar text chatting table for the user according to the set similarity percentage, content with the highest Top10/20 similarity is selected in a customized mode and used as a reference of suspected infringement content, and a similar content list and a matching result report are generated to be downloaded.
A manuscript washing article identification module: the method aims at the copying phenomenon existing in a new media environment, and the manuscript washing phenomena of 'incomplete matching of keywords', 'sequence adjustment of context sentences', 'repeated description of contents' and the like, and realizes the judgment of the copying and infringement degrees by an improved text matching method. The manuscript washing article discovering and identifying module carries out targeted trace detection and text matching aiming at a common method of the manuscript washing article on the basis of the prior art such as causal reasoning, keyword matching, summary extraction and the like, and gives a suspected manuscript washing article with large influence by a quantitative evaluation method.
An automatic infringement processing module: and (3) comparing the suspected manuscript washing and plagiarism articles with the original articles of the user, giving a comparison analysis result, and automatically generating an infringement litigation material sample for downloading by a demand party. The system provides for the batch download of both data analysis reports and litigation documents. The automatic infringement processing module is mainly used for providing copyright litigation material text samples which can be selected by users and can be downloaded in batches according to the requirements of the users, and the function is mainly used for solving the problem that a great deal of copied content creation organizations and individuals need to invest a great deal of time and energy in litigation selection and text drafting.
The prior art text matching techniques to which embodiments of the present application relate will now be described:
text matching is a core application in Natural Language Processing (NLP), and many tasks in the fields of information retrieval, question-answering systems, machine translation, etc. can be abstracted as text matching problems. The traditional text matching technology comprises VSM, TF-IDF, Jaccord, BoW, SIMHash, BM25 and the like, and most of the technologies are used for solving the similarity problem of the vocabulary level, for example, the BM25 algorithm calculates the text matching value through the coverage degree of the query field. The method has the limitations that semantic expression calculation problems exist on the basis of matching of literal words, and the problems of limitation of ambiguous words and knowledge and the like are difficult to solve.
A semantic analysis technology based on a topic model maps sentences to a low-dimensional continuous space with equal length, similarity calculation is carried out in a hidden potential semantic space, and typical algorithms of the idea comprise topic word probability models such as PLSA and LDA. Although some defects of the traditional text matching method are made up in computational efficiency, the method is only effectively supplemented by literal matching in practical application.
With the application of the neural network in the NLP field, the Word Embedding text matching method trained based on the neural network further strengthens the semantic computability of the Word vector representation, and the matching degree calculation obtained by using label-free data training is similar to a topic model and is essentially based on co-occurrence information. However, this approach also does not solve the asymmetric problem of matching. The depth matching model carries out interactive calculation on a matching layer, and adopts methods such as dot product, cosine, Gaussian distribution, MLP, similarity matrix and the like, and classical models such as DSSM, CDSSM, MV-LSTM, ARC-I, CNTN, MultiGranCNN and the like.
In addition, based on an interaction model of semantic focus and context importance modeling, matching signals among vocabularies are used as a gray level graph, then subsequent abstract modeling is carried out, an interaction matrix can be formed by text vocabularies in an interaction layer, interaction operation is similar to attention, a representation layer is responsible for carrying out abstract representation on the interaction matrix, and classical models such as DRMM, Deeprank, IR-Transformer, ESIM, ABCNN and the like are used. The method solves the problem of local information, but cannot depict global matching information by the local matching information. The existing text matching technology is mainly applied to user query of advertisement page similarity, extraction and theme clustering of discourse keywords, personalized news recommendation, vertical news CTR estimation and the like. Different from the semantic matching adopted by the method, the method provided by the embodiment of the application shortens the mapping into paragraphs, adds the algorithm design of semantic distribution matching on the basis of vocabulary literal matching, and realizes the 'manuscript washing detection' with higher difficulty by means of methods such as causal reasoning, keyword extraction, summary extraction and the like.
With the increase of the demand of people for obtaining and classifying mass information of the Internet, the value and benefit of data are maximized in a more professional mode, and an incremental space is provided for a data market. At present, a lot of enterprises engage in mass data collection, most of which are realized by vertical search engine technology, and some of which realize comprehensive application of various technologies.
Generally, the internet data acquisition technology is mainly completed by comprehensive application of vertical search engine technologies such as web crawlers, word segmentation systems, task index systems and the like, and the basic path is as follows: keyword input, URL splicing, URL extraction, simulation request, page crawling judgment, page analysis, data storage and database.
The main problems to be solved in the construction of the data acquisition system comprise: the data source is various, the data volume is large, the data updating is fast, the data is repeated, the data quality is high, and the system is required to have reliability and stability. Most of the currently popular data collection platforms abstract an input, output and intermediate buffer architecture, and provide an extensible data collection function for users by using distributed network connection.
Reference to data acquisition systems includes, but is not limited to, the following:
(1) apache Flume. The Flume is a data acquisition system which is under the Apache flag, has the advantages of source opening, high reliability, high extension, easy management and support for client extension. Flume is built using JRuby, so it relies on the Java runtime environment. Flume was originally a system designed by Cloudera's engineers for merging log data, and was later developed for handling streaming data events. The flash is designed as a distributed pipeline architecture, and can be regarded as a network with an Agent between a data source and a destination, and supports data routing. Each agent consists of Source, Channel and Sink.
(2) Fluent open source framework. The framework was developed using C/Ruby, using JSON files to unify log data. The pluggable framework supports various data sources and data outputs with different types and formats. Finally, it also provides high reliability and good expansibility. The Fluentd is very similar to Flume in deployment, and its Input/Buffer/Output is very similar to Flume's Source/Channel/Sink. Although fluent looks like Flume in all aspects, the main difference is that Footprint is smaller by using Ruby development, but the cross-platform problem is brought, and the Windows platform cannot be supported.
(3) Logstash is a well-known L in the open source data stack ELK (elastic search, Logstash, Kibana). Logstash was developed with JRuby, all runtimes rely on JVM. Typical logstack configurations include settings for Input, filter Output. The ELK is used simultaneously as a stack in almost most cases. In case the data system uses ElasticSearch logstack is preferred.
(4) Splunk Forwarder. In a commercial big data platform product, Splunk can provide a complete data acquisition, storage, analysis and presentation system. It is a distributed data platform, mainly include: the Search Head is responsible for searching and processing data and provides information extraction during searching; indexer is responsible for storage and indexing of data; forwarder, responsible for data collection, cleaning, deformation, and sending to Indexer. The system is internally provided with support for Syslog, TCP/UDP and Spooling, and a user can acquire specific data by developing an Input and a Modular Input. A software warehouse provided by the splnk has a plurality of mature data acquisition applications, such as AWS, a database (DBConnect), and the like, and can conveniently acquire data from a cloud or a database and enter a data platform of the splnk for analysis. The method has the defect that if one Farwarder machine fails, data collection is interrupted, and the running data collection task Failover cannot be transferred to other Farwarders.
The embodiment of the application is based on a cross-platform data acquisition scheme and a text matching technology, and can be used for copyright protection. Firstly, a cross-platform network data acquisition system is built, and the capture of the Internet public content data can be realized. The acquisition system is embedded with a text retrieval and text analysis module, and can extract related data in a keyword retrieval mode according to user requirements. Meanwhile, the existing text content matching method is improved in a targeted manner, so that the method is suitable for identifying similar manuscripts in a new media environment, articles suspected of being washed can be found, and corresponding clues and evidences are provided. And then, when the user initiates session information and initiates a request to the server, the system carries out information acquisition and analysis and returns a suspected infringement content list and a content sample for the user. And finally, by means of the modularized legal document template, providing a sample service of copyright litigation material for the user. The invention can complete the retrieval and acquisition of infringing content in a new media era by a big data acquisition and analysis system and by means of an improved text matching technology, and provides an effective solution for the copyright protection problem of network content.
It can be seen that the embodiments of the present application serve for copyright protection in the new media domain. The copyright protection system realized based on the cross-platform data acquisition platform is beneficial to overcoming the problems of platform obstacle, large quantity, difficult tracking and the like in copyright protection, and greatly reduces the copyright cost of a media producer and an independent content creator. The method has the characteristics of high concurrency, low delay and simple and flexible operation. Aiming at cross-platform internet content acquisition expansion, the architecture comprises two main parts, namely a data acquisition and analysis system and a content matching scheme, and a barrier between a network content acquisition system and a copyright discrimination and protection system is opened. In the technical level of content matching, an independently developed text matching algorithm and a repeated index design are adopted, and the algorithm is different from a general literal matching rule and an algorithm rule taking vocabulary proportion matching as a core, and is favorable for finding text contents of 'manuscript washing' and 'recomposing'.
To sum up, the embodiment of the present application provides a cross-platform data analysis method, where a cross-platform data analysis system is used to receive target analysis content sent by a user device, and the cross-platform data analysis system is embedded with a text retrieval module and a text analysis module; searching the keywords according to the user requirements, and extracting search data; performing feature mining and intelligent analysis on the retrieval data based on a statistical learning method, a natural language processing technology, a machine learning method and a deep learning method to obtain an analysis result; and judging whether the infringement condition is met or not according to the analysis result, and if so, generating an infringement report and comparing the analysis result for downloading by the user equipment. By means of big data acquisition and analysis and with the help of an improved text matching technology, the method can finish the retrieval and acquisition of infringing content in a new media era, and provides an effective solution for the copyright protection problem of network content.
Based on the same technical concept, an embodiment of the present application further provides a cross-platform data analysis system, as shown in fig. 3, the system includes:
an information receiving module 301, configured to receive target analysis content sent by user equipment, where the cross-platform data analysis system is embedded with a text retrieval module and a text analysis module.
And the retrieval module 302 is used for retrieving the keywords according to the user requirements and extracting the retrieval data.
And the analysis module 303 is configured to perform feature mining and intelligent analysis on the search data based on a statistical learning method, a natural language processing technology, a machine learning method, and a deep learning method to obtain an analysis result.
And the result judgment module 304 is configured to judge whether an infringement condition is met according to the analysis result, and if so, generate an infringement report and compare the analysis result for downloading by the user equipment.
Optionally, the retrieving module 302 is specifically configured to: setting a session box in the cross-platform data analysis system, so that a user inputs content keywords to be retrieved according to requirements and eliminates redundant information; obtaining related content on an internet platform in a specific time period, wherein the related content belongs to the user account in a database of the cross-platform data analysis system; and searching the keywords according to the user requirements, and extracting the searched data.
Optionally, the analysis module 303 is specifically configured to: and shortening the mapping into paragraphs, adding a semantic distribution matching algorithm on the basis of vocabulary literal matching, and performing similarity analysis on the retrieval data by combining algorithms of causal reasoning, keyword extraction and summary extraction to obtain an analysis result.
Optionally, the result determining module 304 is specifically configured to: and comparing the searched suspected manuscript washing and plagiarism articles with the target analysis content, and generating an infringement litigation material sample and a data analysis report when the infringement condition is reached.
Based on the same technical concept, an embodiment of the present application further provides an apparatus, including: the device comprises a data acquisition device, a processor and a memory; the data acquisition device is used for acquiring data; the memory is to store one or more program instructions; the processor is configured to execute one or more program instructions to perform any of the methods described herein.
Based on the same technical concept, the embodiment of the present application further provides a computer-readable storage medium, wherein the computer-readable storage medium contains one or more program instructions, and the one or more program instructions are used for executing any one of the methods.
In the present specification, each embodiment of the method is described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. Reference is made to the description of the method embodiments.
It is noted that while the operations of the methods of the present invention are depicted in the drawings in a particular order, this is not a requirement or suggestion that the operations must be performed in this particular order or that all of the illustrated operations must be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
Although the present application provides method steps as in embodiments or flowcharts, additional or fewer steps may be included based on conventional or non-inventive approaches. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an apparatus or client product in practice executes, it may execute sequentially or in parallel (e.g., in a parallel processor or multithreaded processing environment, or even in a distributed data processing environment) according to the embodiments or methods shown in the figures. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the presence of additional identical or equivalent elements in a process, method, article, or apparatus that comprises the recited elements is not excluded.
The units, devices, modules, etc. set forth in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, in implementing the present application, the functions of each module may be implemented in one or more software and/or hardware, or a module implementing the same function may be implemented by a combination of a plurality of sub-modules or sub-units, and the like. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may therefore be considered as a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.
The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a mobile terminal, a server, or a network device) to execute the method according to the embodiments or some parts of the embodiments of the present application.
The embodiments in the present specification are described in a progressive manner, and the same or similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable electronic devices, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The above-mentioned embodiments are further described in detail for the purpose of illustrating the invention, and it should be understood that the above-mentioned embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method of cross-platform data analysis, the method comprising:
a cross-platform data analysis system receives target analysis content sent by user equipment, and a text retrieval module and a text analysis module are embedded into the cross-platform data analysis system;
searching the keywords according to the user requirements, and extracting search data;
performing feature mining and intelligent analysis on the retrieval data based on a statistical learning method, a natural language processing technology, a machine learning method and a deep learning method to obtain an analysis result;
and judging whether the infringement condition is met or not according to the analysis result, and if so, generating an infringement report and comparing the analysis result for downloading by the user equipment.
2. The method of claim 1, wherein the retrieving the keywords according to the user requirement and extracting the retrieved data comprises:
setting a session box in the cross-platform data analysis system, so that a user inputs content keywords to be retrieved according to requirements and eliminates redundant information;
obtaining related content on an internet platform in a specific time period, wherein the related content belongs to the user account in a database of the cross-platform data analysis system;
and searching the keywords according to the user requirements, and extracting the searched data.
3. The method of claim 1, wherein the performing feature mining and intelligent analysis on the search data based on statistical learning, natural language processing, machine learning, and deep learning to obtain analysis results comprises:
and shortening the mapping into paragraphs, adding a semantic distribution matching algorithm on the basis of vocabulary literal matching, and performing similarity analysis on the retrieval data by combining algorithms of causal reasoning, keyword extraction and summary extraction to obtain an analysis result.
4. The method of claim 1, wherein the determining whether an infringement condition is met based on the analysis results, and if so, generating an infringement report and comparing the analysis results, comprising:
and comparing the searched suspected manuscript washing and plagiarism articles with the target analysis content, and generating an infringement litigation material sample and a data analysis report when the infringement condition is reached.
5. A cross-platform data analysis system, the system comprising:
the information receiving module is used for receiving target analysis content sent by user equipment, and the cross-platform data analysis system is embedded into the text retrieval module and the text analysis module;
the retrieval module is used for retrieving the keywords according to the user requirements and extracting retrieval data;
the analysis module is used for carrying out feature mining and intelligent analysis on the retrieval data based on a statistical learning method, a natural language processing technology, a machine learning method and a deep learning method to obtain an analysis result;
and the result judgment module is used for judging whether the infringement condition is met or not according to the analysis result, and if so, generating an infringement report and comparing the analysis result for downloading by the user equipment.
6. The system of claim 5, wherein the retrieval module is specifically configured to:
setting a session box in the cross-platform data analysis system, so that a user inputs content keywords to be retrieved according to requirements and eliminates redundant information;
obtaining related content on an internet platform in a specific time period, wherein the related content belongs to the user account in a database of the cross-platform data analysis system;
and searching the keywords according to the user requirements, and extracting the searched data.
7. The system of claim 5, wherein the analysis module is specifically configured to:
and shortening the mapping into paragraphs, adding a semantic distribution matching algorithm on the basis of vocabulary literal matching, and performing similarity analysis on the retrieval data by combining algorithms of causal reasoning, keyword extraction and summary extraction to obtain an analysis result.
8. The system of claim 5, wherein the result determination module is specifically configured to:
and comparing the searched suspected manuscript washing and plagiarism articles with the target analysis content, and generating an infringement litigation material sample and a data analysis report when the infringement condition is reached.
9. An apparatus, characterized in that the apparatus comprises: the device comprises a data acquisition device, a processor and a memory;
the data acquisition device is used for acquiring data; the memory is to store one or more program instructions; the processor, configured to execute one or more program instructions to perform the method of any of claims 1-4.
10. A computer-readable storage medium having one or more program instructions embodied therein for performing the method of any of claims 1-4.
CN202010746899.7A 2020-07-29 2020-07-29 Cross-platform data analysis method, system, equipment and readable storage medium Pending CN112000929A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010746899.7A CN112000929A (en) 2020-07-29 2020-07-29 Cross-platform data analysis method, system, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010746899.7A CN112000929A (en) 2020-07-29 2020-07-29 Cross-platform data analysis method, system, equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN112000929A true CN112000929A (en) 2020-11-27

Family

ID=73462497

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010746899.7A Pending CN112000929A (en) 2020-07-29 2020-07-29 Cross-platform data analysis method, system, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN112000929A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112950326A (en) * 2021-03-17 2021-06-11 南通大学 Artificial intelligence data analysis system supporting deep learning working principle
CN113190657A (en) * 2021-05-18 2021-07-30 中国银行股份有限公司 NLP data preprocessing method, jvm and spark end server
CN113343149A (en) * 2021-06-22 2021-09-03 深圳市网联安瑞网络科技有限公司 Agent-based mobile terminal social media propagation effect evaluation method, system and application

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180091546A1 (en) * 2016-09-29 2018-03-29 Camelot Uk Bidco Limited Browser Extension for Contemporaneous in-Browser Tagging and Harvesting of Internet Content
CN109635090A (en) * 2018-12-14 2019-04-16 安徽中船璞华科技有限公司 A kind of copyright method for tracing based on machine learning
CN110851761A (en) * 2020-01-15 2020-02-28 支付宝(杭州)信息技术有限公司 Infringement detection method, device and equipment based on block chain and storage medium
CN111159389A (en) * 2019-12-31 2020-05-15 重庆邮电大学 Keyword extraction method based on patent elements, terminal and readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180091546A1 (en) * 2016-09-29 2018-03-29 Camelot Uk Bidco Limited Browser Extension for Contemporaneous in-Browser Tagging and Harvesting of Internet Content
CN109635090A (en) * 2018-12-14 2019-04-16 安徽中船璞华科技有限公司 A kind of copyright method for tracing based on machine learning
CN111159389A (en) * 2019-12-31 2020-05-15 重庆邮电大学 Keyword extraction method based on patent elements, terminal and readable storage medium
CN110851761A (en) * 2020-01-15 2020-02-28 支付宝(杭州)信息技术有限公司 Infringement detection method, device and equipment based on block chain and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112950326A (en) * 2021-03-17 2021-06-11 南通大学 Artificial intelligence data analysis system supporting deep learning working principle
CN113190657A (en) * 2021-05-18 2021-07-30 中国银行股份有限公司 NLP data preprocessing method, jvm and spark end server
CN113190657B (en) * 2021-05-18 2024-02-27 中国银行股份有限公司 NLP data preprocessing method, jvm and spark end server
CN113343149A (en) * 2021-06-22 2021-09-03 深圳市网联安瑞网络科技有限公司 Agent-based mobile terminal social media propagation effect evaluation method, system and application

Similar Documents

Publication Publication Date Title
CN109189942B (en) Construction method and device of patent data knowledge graph
US9798818B2 (en) Analyzing concepts over time
Qin et al. DuerQuiz: A personalized question recommender system for intelligent job interview
US11521603B2 (en) Automatically generating conference minutes
CN112131449B (en) Method for realizing cultural resource cascade query interface based on ElasticSearch
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
US20150324350A1 (en) Identifying Content Relationship for Content Copied by a Content Identification Mechanism
CN104978332B (en) User-generated content label data generation method, device and correlation technique and device
CN112000929A (en) Cross-platform data analysis method, system, equipment and readable storage medium
US10083031B2 (en) Cognitive feature analytics
Sarne et al. Unsupervised topic extraction from privacy policies
CN109344298A (en) A kind of method and device converting unstructured data to structural data
US20120317125A1 (en) Method and apparatus for identifier retrieval
Chawla et al. Automatic bug labeling using semantic information from LSI
Shekhawat Sentiment classification of current public opinion on BREXIT: Naïve Bayes classifier model vs Python’s TextBlob approach
Baquero et al. Predicting the programming language: Extracting knowledge from stack overflow posts
Al-Msie'deen et al. Automatic documentation of [mined] feature implementations from source code elements and use-case diagrams with the REVPLINE approach
CN112035723A (en) Resource library determination method and device, storage medium and electronic device
Huang et al. Query expansion based on statistical learning from code changes
CN113742496B (en) Electric power knowledge learning system and method based on heterogeneous resource fusion
Maynard et al. Change management for metadata evolution
Pu et al. A vision-based approach for deep web form extraction
Zhang et al. An improved ontology-based web information extraction
KR102625347B1 (en) A method for extracting food menu nouns using parts of speech such as verbs and adjectives, a method for updating a food dictionary using the same, and a system for the same
Ho et al. Data warehouse designing for Vietnamese textual document-based plagiarism detection system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination