CN114742055A - Data processing method, data processing apparatus, electronic device, medium, and program product - Google Patents

Data processing method, data processing apparatus, electronic device, medium, and program product Download PDF

Info

Publication number
CN114742055A
CN114742055A CN202210325701.7A CN202210325701A CN114742055A CN 114742055 A CN114742055 A CN 114742055A CN 202210325701 A CN202210325701 A CN 202210325701A CN 114742055 A CN114742055 A CN 114742055A
Authority
CN
China
Prior art keywords
information
event
subject
text
source heterogeneous
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210325701.7A
Other languages
Chinese (zh)
Inventor
刘雨亮
胡殿明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ganyi Intelligent Technology Co ltd
Original Assignee
Beijing Ganyi Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ganyi Intelligent Technology Co ltd filed Critical Beijing Ganyi Intelligent Technology Co ltd
Priority to CN202210325701.7A priority Critical patent/CN114742055A/en
Publication of CN114742055A publication Critical patent/CN114742055A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention provides a data processing method, a device, an electronic device, a medium and a program product, belonging to the technical field of communication and comprising the following steps: event detection and main body identification are carried out on multi-source heterogeneous data, and event information and main body information corresponding to the multi-source heterogeneous data are obtained; determining relation information corresponding to the subject information based on the subject information and a preset standardized subject library, and performing attribute extraction on the event information to obtain attribute information corresponding to the event information, wherein the preset standardized subject library comprises a plurality of subject information, and each subject information is associated with one or more groups of relation information; and obtaining event four-tuple information of the multi-source heterogeneous data based on the event information, the main body information, the attribute information and the relation information.

Description

Data processing method, data processing device, electronic equipment, medium and program product
Technical Field
The present invention relates to the field of communications technologies, and in particular, to a data processing method, an apparatus, an electronic device, a medium, and a program product.
Background
With the rapid development of information technology, a large amount of multi-source heterogeneous data may exist in daily life, which is stored in different data sources or information sources, has different format types, such as web pages, texts/files, pictures/multimedia, databases, and the like, and has various types of data with a certain scale, and most of the data cannot be directly used for analysis and judgment by a computer.
Therefore, how to effectively and conveniently analyze massive multi-source heterogeneous data by calculation has become an urgent problem to be solved in the industry.
Disclosure of Invention
The invention provides a data processing method, a data processing device, electronic equipment, a medium and a program product, which are used for solving the defect that massive heterogeneous data is inconvenient to calculate and directly analyze in the prior art.
The invention provides a data processing method, which comprises the following steps:
event detection and main body identification are carried out on multi-source heterogeneous data, and event information and main body information corresponding to the multi-source heterogeneous data are obtained;
determining relation information corresponding to the subject information based on the subject information and a preset standardized subject library, and performing attribute extraction on the event information to obtain attribute information corresponding to the event information, wherein the preset standardized subject library comprises a plurality of subject information, and each subject information is associated with one or more groups of relation information;
and obtaining event four-tuple information of the multi-source heterogeneous data based on the event information, the main body information, the attribute information and the relation information.
According to the data processing method provided by the invention, event detection and subject identification are carried out on multi-source heterogeneous data to obtain event information and subject information corresponding to the multi-source heterogeneous data, and the method specifically comprises the following steps:
under the condition that the multi-source heterogeneous data is unstructured, performing textualization processing on the multi-source heterogeneous data to obtain text information corresponding to the multi-source heterogeneous data;
inputting a preset text main body classification model to the text information to obtain a text main body type corresponding to the text information;
inputting the text information into a named entity recognition model corresponding to the text main body type, and outputting first main body information corresponding to the text information;
determining the subject information based on text density information of each of the first subject information when the first subject information is plural, or regarding the first subject information as the subject information when the first subject information is one;
and performing event detection on the text information to obtain event information corresponding to the multi-source heterogeneous data.
According to a data processing method provided by the present invention, the event detection on the text information to obtain event information corresponding to the multi-source heterogeneous data includes:
inputting the text information into a preset text event classification model and a preset text semantic model, and outputting first event classification information and first semantic event information corresponding to the text information;
carrying out event recognition processing on the sentence set corresponding to the text information to obtain a sentence event set, and carrying out event recognition processing on the paragraph set corresponding to the text information to obtain a paragraph event set;
merging the events of the same type in the paragraph event set of the sentence event set to obtain a merged target event set;
and determining event information corresponding to the multi-source heterogeneous data in the target event set based on the occurrence frequency of each event in the target event set.
According to a data processing method provided by the present invention, the determining, based on the subject information and a preset standardized subject library, relationship information corresponding to the subject information includes:
matching corresponding standard subject information in the preset standardized subject library based on the subject information;
and obtaining the relation information corresponding to the standard main body information to obtain the relation information corresponding to the main body information.
According to the data processing method provided by the invention, after the event four-tuple information of the multi-source heterogeneous data is obtained, the method further comprises the following steps:
acquiring event history data corresponding to the event information;
analyzing the event duration and the event heat of the event historical data to obtain event duration information and event heat information;
and obtaining event evaluation information corresponding to the event information based on the event duration information, the event heat information and the historical occurrence frequency of the event information.
According to the data processing method provided by the invention, after the event four-tuple information of the multi-source heterogeneous data is obtained, the method further comprises the following steps:
inputting first event information and second event information in each event four-tuple information into an event occurrence probability prediction model to obtain the prediction probability of the second event information after the first event information occurs;
wherein the first event information and the second event information have a temporal correlation.
The present invention also provides a data processing apparatus comprising:
the analysis module is used for carrying out event detection and main body identification on multi-source heterogeneous data to obtain event information and main body information corresponding to the multi-source heterogeneous data;
the determining module is used for determining relationship information corresponding to the subject information based on the subject information and a preset standardized subject library, and performing attribute extraction on the event information to obtain attribute information corresponding to the event information, wherein the preset standardized subject library comprises a plurality of subject information, and each subject information is associated with one or more groups of relationship information;
and the processing module is used for obtaining event quadruple information of the multi-source heterogeneous data based on the event information, the main body information, the attribute information and the relation information.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor implements any of the data processing methods described above when executing the program.
The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a data processing method as described in any of the above.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a data processing method as described in any one of the above.
According to the data processing method, the data processing device, the electronic equipment, the medium and the program product, the attribute information of the event data is extracted after the semantic event in the multi-source heterogeneous data is converted into the event, the subject identification of the multi-source heterogeneous data is combined with the preset standardized subject library, so that the relation information is effectively obtained, the four-tuple form data which is the event information, the subject information, the attribute information and the relation information and can be calculated, identified and analyzed is finally obtained, and the problem that the multi-source heterogeneous data is often difficult to directly analyze by a computer is effectively solved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present application;
fig. 2 is a schematic diagram of a label extraction provided in an embodiment of the present application;
FIG. 3 is a schematic diagram of an event system provided by an embodiment of the present application;
FIG. 4 is a schematic diagram of a sliding time window provided by an embodiment of the present application;
fig. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;
fig. 6 illustrates a physical structure diagram of an electronic device.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.
Fig. 1 is a schematic flow chart of a data processing method provided in an embodiment of the present application, and as shown in fig. 1, the method includes:
step 110, performing event detection and main body identification on multi-source heterogeneous data to obtain event information and main body information corresponding to the multi-source heterogeneous data;
specifically, the multi-source heterogeneity described in the embodiments of the present application may specifically include structured data and unstructured data, where the structured data refers to data stored in a relational database or may be directly analyzed and used by a computer, and the unstructured data refers to data other than structured data.
More specifically, the multi-source heterogeneous data described in the embodiment of the present application may specifically include: one or more of video data, voice data, picture data, web page data, and file data.
According to the method and the device, the event detection is carried out on the multi-source heterogeneous data, so that the effective event information in the multi-source heterogeneous data can be effectively brought forward.
The event information described in the embodiment of the present application may specifically refer to an event or a state change that occurs within a certain time point or time period and a certain geographical area and is composed of one or more actions in which one or more roles participate. Such as complaint related events, supervised inquiry events, revenue down events.
The subject information described in the embodiment of the present application may specifically refer to a participant of an event, which may specifically be a natural person, or a company organization, and the subject information may specifically be a whole course, a short name, an alternative name, or a registration name, market plate information, and the like of the company organization.
Step 120, determining relationship information corresponding to the subject information based on the subject information and a preset standardized subject library, and performing attribute extraction on the event information to obtain attribute information corresponding to the event information, wherein the preset standardized subject library comprises a plurality of subject information, and each subject information is associated with one or more groups of relationship information;
specifically, the preset standardized subject library described in the embodiment of the present application may specifically include a historical full-amount subject name, an alias, a short name, a full name, industry and commerce basic information, market plate information, industry classification information, and product classification information.
After the subject information is determined, more accurate subject information can be further determined according to a preset standardized subject library, so that the effect of correcting the subject information is achieved, more specific relation information can be further obtained, and data types are enriched.
The attribute information described in the embodiment of the present application may specifically refer to attribute information such as time, location, emotion, person, and product name, and more specifically, specific information may be extracted as the attribute information for a specific event, for example, a designed numerical value may be further extracted as an attribute of the event for a financial event.
Step 130, obtaining event four-tuple information of the multi-source heterogeneous data based on the event information, the subject information, the attribute information and the relationship information.
The event quadruple information described in the embodiments of the present application necessarily includes body information and event information, and the corresponding attribute information and relationship information may be null.
In the embodiment of the application, after semantic eventing in the multi-source heterogeneous data is performed, the attribute information of the eventing data is extracted, the main body of the multi-source heterogeneous data is identified, and a preset standardized subject library is combined, so that the relation information is effectively obtained, the event information, the main body information, the attribute information and the relation information which are in a quadruple form and can be calculated, identified and analyzed are finally obtained, and the problem that the multi-source heterogeneous data is often difficult to be directly analyzed by a computer is effectively solved.
Optionally, the event detection and the subject identification are performed on the multi-source heterogeneous data to obtain event information and subject information corresponding to the multi-source heterogeneous data, and the method specifically includes:
under the condition that the multi-source heterogeneous data is unstructured, performing textualization processing on the multi-source heterogeneous data to obtain text information corresponding to the multi-source heterogeneous data;
inputting a preset text main body classification model to the text information to obtain a text main body type corresponding to the text information;
inputting the text information into a named entity recognition model corresponding to the text main body type, and outputting first main body information corresponding to the text information;
determining the subject information based on text density information of each of the first subject information when the first subject information is plural, or regarding the first subject information as the subject information when the first subject information is one;
and performing event detection on the text information to obtain event information corresponding to the multi-source heterogeneous data.
Specifically, under the condition that the multi-source heterogeneous data is unstructured data, it is indicated that the data cannot be directly utilized and processed by a computer, and at this time, the data needs to be firstly subjected to textual processing, which specifically may be:
video: through the deep learning technology, the content of the video is labeled, corresponding basic content labels are marked, such as financial information videos and fund videos, and the content labels are used as the processed content to perform the next analysis and operation.
And (3) voice: and (3) through an automatic speech recognition deep learning model, the contents of the speech are specially used as the contents of the plain text, and the next analysis and operation are carried out.
Picture: through an optical character recognition technology, text contents in the pictures are extracted, meanwhile, the contents of the pictures are classified through a picture classification model, classification labels such as bill pictures and running water pictures are printed, and the next analysis and operation are carried out.
Web page: and extracting the text content in the html through an html text density recognition model, storing the text content in a plain text form, and then carrying out the next analysis and operation.
File: and selecting a corresponding adapter to extract text contents through adaptation of a file format, wherein common file formats including TXT, PDF and Word file formats are adapted currently, extracting contents in files by adopting a file extraction tool to form a text form for storage, and then carrying out analysis and operation of the next step.
In the embodiment of the application, after the text information corresponding to the multi-source heterogeneous data is obtained, the text information can be further input into a preset text main body classification model to obtain the text main body type corresponding to the text information, and whether the text is the text of different categories such as enterprise description information, industry information, macro/policy information, securities (stocks, bonds, funds, financing, derivatives and the like) information and the like is judged by judging the category and the keyword of the text description content, and the preset text main body classification model can be obtained by training a preset neural network model according to a text training sample carrying text category labels.
After the text main body type is determined, a named entity recognition model corresponding to the new type of the main body is further determined in the embodiment of the application, and then the main body designed by the text information is extracted, wherein the extraction comprises the steps of extracting an enterprise name from the enterprise information, extracting an industry name from the industry information, extracting a macroscopic name from the macroscopic information, and extracting a security product name from the security product information.
If the text has multi-subject information, further marking the extracted subject with corresponding correlation degree through text density judgment and a text semantic density model, outputting a (subject, correlation degree) pair, and selecting the subject with the highest correlation degree as a target subject of the text.
In the embodiment of the application, under the condition of obtaining the main information, event detection is further performed on the text information to obtain corresponding event information.
Optionally, for the structured data, through a big data architecture, performing tag/index processing calculation on the data, and further performing derivative calculation on a tag/index variation part to form a corresponding event tag. Fig. 2 is a schematic diagram of extracting a label according to an embodiment of the present application, as shown in fig. 2, including:
index/label processing: through the method of financial engineering, define the label with certain meaning, take the numerical value that this characteristic corresponds as its index at the same time. For example, the "income" subject is selected as a feature from the financial subjects, and the specific numerical value of the "income" is 10 billion as an index. The pair of machining outputs (label, index) is the output of this section.
Index/tag derivation calculation: and (4) combining all data information of the (label, index) pair history, designing the caliber and the mode of calculation (such as homonymy calculation, ring ratio calculation, period calculation and the like) to obtain the result of derivative calculation, and further integrating the result of derivative calculation in the form of (derivative label, calculation result) pair, for example, (business income is reduced, -10%) as output.
In the embodiment of the application, after the multi-source heterogeneous data is subjected to the text processing, the main body type of the text is analyzed, and the corresponding main body information can be effectively extracted from the text information through the corresponding named entity recognition model.
Optionally, the performing event detection on the text information to obtain event information corresponding to the multi-source heterogeneous data includes:
inputting the text information into a preset text event classification model and a preset text semantic model, and outputting first event classification information and first semantic event information corresponding to the text information;
carrying out event recognition processing on the sentence set corresponding to the text information to obtain a sentence event set, and carrying out event recognition processing on the paragraph set corresponding to the text information to obtain a paragraph event set;
merging the events of the same type in the paragraph event set of the sentence event set to obtain a merged target event set;
and determining event information corresponding to the multi-source heterogeneous data in the target event set based on the occurrence frequency of each event in the target event set.
Specifically, the preset text event classification model described in the embodiment of the present application is obtained by training based on a text training sample set written with a text event classification label, and the preset text semantic model is obtained by training based on a text training sample set carrying an event semantic label.
Classifying the whole text, and respectively inputting text information into a preset text event classification model to obtain first event classification information event _ cls; and outputting the first semantic event information event _ nlp of the whole article through a preset text semantic model.
The method comprises the steps of splitting an article of text information into a paragraph set consisting of a plurality of paragraphs, splitting the article of text information into a sentence set consisting of a plurality of sentences, and executing event recognition operation on each part in the paragraph set and the sentence set to obtain a sentence event set and a paragraph event set.
The event recognition operation described in the embodiment of the present application may specifically be to perform event classification on each paragraph and sentence, respectively, to obtain a set _ p (event1, event2 …); the sentence event set _ s (event1, event2 …). And respectively merging events of the same type in the set _ s and the set _ p, and counting the frequency of the events in the events to form a nonrepeating event set and corresponding frequency. The sequence is arranged in a freq size reverse order to obtain set _ p _ unique (event1-freq, event2-freq), and set _ s _ unique (event1-freq, event2-freq), wherein freq is the corresponding frequency; taking the lengths of the set _ p _ unique and the set _ s _ unique as n1 and n2 respectively, taking the length top [ n/2] as an event participating in further analysis, obtaining the set _ p _ top (event1 and event2 …) and the set _ s _ top (event1 and event2 …), selecting all events in the event _ cls, event _ nlp, set _ p _ top and set _ s _ top, generating final event information through a voting algorithm, and taking the rest events as a candidate event library.
In the embodiment of the application, event information corresponding to multi-source heterogeneous data is effectively extracted by performing event analysis on the text.
Optionally, the determining, based on the subject information and a preset standardized subject library, relationship information corresponding to the subject information includes:
matching corresponding standard subject information in the preset standardized subject library based on the subject information;
and obtaining the relation information corresponding to the standard main body information to obtain the relation information corresponding to the main body information.
Specifically, in the embodiment of the present application, corresponding subject information is searched in the preset standardized subject library based on the subject information, if the same subject information can be directly found in the subject library, the subject information is used as standard subject information to obtain corresponding relationship information, and if the corresponding subject information cannot be directly found in the preset standardized subject library, similarity comparison may be further performed on subject names in the remaining preset standardized subject libraries, and the closest subject name is found as final standard subject information, and meanwhile, the corresponding relationship information is obtained.
The relationship information described in the embodiment of the present application may specifically include information about industry and commerce, market, industry, and product in the main body library.
In the embodiment of the application, the subject information can be further corrected by presetting a standardized subject library, and the corresponding relationship information is obtained.
In some embodiments, after obtaining event four-tuple information, an event system may be constructed according to different event main body types, and meanwhile, in consideration of similarity and correlation in meaning of events themselves, different events are summarized and merged in a hierarchical manner, where fig. 3 is an event system diagram provided by an embodiment of the present application, and includes:
the primary event takes a main body category as a division standard and describes different main body types of the event, including enterprise events, industry events, macro/policy events and security product events.
The secondary events are classified into the dimension of occurrence of the main event as a division standard, and are used for describing events occurring in different data ranges under the unified main category, for example, enterprise events can be further classified into enterprise financial events, enterprise legal events, enterprise operation events, enterprise risk events and the like.
The three-level event is an original event quadruple and is obtained by processing from different data sources through the technical means.
Optionally, after obtaining the event quadruple information of the multi-source heterogeneous data, the method further includes:
acquiring event history data corresponding to the event information;
analyzing the event duration and the event heat of the event historical data to obtain event duration information and event heat information;
and obtaining event evaluation information corresponding to the event information based on the event duration information, the event heat information and the historical occurrence frequency of the event information.
Specifically, after the event quadruplets are produced, further analysis is performed on the event quadruplets based on the event system.
Due to the fact that persistence exists after the event occurs, relevant data of the event can be generated continuously within a period of time after the event occurs, magnitude and time of the data are counted, statistics is conducted according to a day dimension, and duration of all the events occurring in history and daily event heat are obtained.
And (4) performing statistical analysis on all the duration and the heat, and taking the mean value/maximum value/minimum value as an analysis basis. The number of times this event occurred historically was combined as another dimension of analysis. The number of occurrences of the event, the average, maximum or minimum of the event duration/heat historically, is obtained.
Regarding the event with the occurrence frequency smaller than the first preset threshold as a rare event, and taking the maximum value of the duration/heat of the event as the influence degree and the influence range of the event. The ratio of the number of occurrences of the event to the total number of occurrences of the event is used as the influence probability.
Regarding the events with the occurrence frequency exceeding the first preset threshold and the events with the occurrence frequency smaller than the second preset threshold as common events, and taking the average value of the event duration/heat as the influence degree and the influence range of the events. The ratio of the number of occurrences of the event to the total number of occurrences of the event is used as the influence probability.
Regarding the event with the occurrence frequency exceeding the second preset threshold as a frequent event, and taking the minimum value of the duration/heat of the event as the influence degree and the influence range of the event. The ratio of the number of occurrences of the event to the total number of occurrences of the event is used as the influence probability.
Optionally, the above is a quantitative depiction of the degree of influence, the range of influence and the probability of influence on the event. But for enterprise-like events, especially events related to listed enterprises, further quantitative backtesting of events through stock prices is carried out, including:
historically, all stock prices 30 days before and after the event occurred were taken as the measurement data set.
After the event occurs, the change of stock price is counted, and the maximum value/mean value/7-day value/30-day value of rising/falling is recorded.
And (4) counting the rising/falling conditions, and calculating the ratio of the counting values to obtain the influence probability of the event. For example, at an event, the 80% probability drops by more than 5%.
In the embodiment of the application, the event historical data is analyzed from the aspects of event duration, event heat and the like, so that the qualitative analysis can be effectively carried out on the event.
As an optional embodiment, by performing quadruple analysis on events of different sources, whether the events are the same event is determined according to the event analysis result, and if the events are the same event, a merging operation is performed on the events. If the events are not the same events, whether the events are similar events with the same main body or not is judged, if the events are similar events, backtracking analysis is carried out on the events, and whether the events have time relevance before and after the events is judged. And finally, performing qualitative or quantitative measurement and calculation on the event, and judging the influence range and degree of the event.
As an optional embodiment, the event is taken as the right interval of the sliding window, the time sliding window is set, all event quadruples of the same main body in the sliding window are taken for tracking analysis, and meanwhile, the correlation and the relevance of the event performance are judged by combining a time sequence algorithm, so that the event backtracking analysis is realized. Fig. 4 is a schematic diagram of sliding a time window provided in the embodiment of the present application, and as shown in fig. 4, a specific process includes:
setting a specific type of event as an event N, setting the occurrence time of the event as now, and taking the event sequence in the sliding time window backwards to record as S ═ S (t2 … tN-1).
Taking the same type of subject (e.g., business subject) historically, all sequences in which this event occurred were recorded as SEQ (S1, S2 …), and each sequence length value was recorded as SEQ _ L ═ L (L1, L2 …), where Ln ═ len (sn)
Taking SEQ as a sample, solving the maximum subsequence set of each event sequence in the SEQ, and recording the maximum subsequence set as Tqmax
With Tqmax as the sequence of predicted tN, the correlation therein is further calculated as follows:
if Tqmax is null, or the length is 1, then the event cannot be predicted, and the record correlation is 0.
If the length of Tqmax is greater than 1, the correlation of the event N is len (Tqmax), that is, the length of the maximum subsequence of the event-related event.
And further calculating the relevance by taking the Tqmax as a prediction sequence:
if Tqmax is null or the length is 1, the event cannot be predicted, and the record relevance is 0.
And predicting the relevance between the time series model and the S set by adopting a time series model, and recording the value output by the model as the relevance.
In the embodiment of the application, event four-tuples are combined and classified by a systematization method, and analysis schemes such as further qualitative/quantitative analysis, backtracking analysis and cross fusion analysis of events are provided on the basis of an event system, so that massive multi-source heterogeneous data can be subjected to fusion analysis through the event system, the defects of historical single-point data analysis are overcome, and the problem that a single data dimension is not systematized is solved.
Optionally, after obtaining the event quadruple information of the multi-source heterogeneous data, the method further includes:
inputting first event information and second event information in each event four-tuple information into an event occurrence probability prediction model to obtain the prediction probability of the second event information after the first event information occurs;
wherein the first event information and the second event information have a temporal correlation.
Specifically, in the embodiment of the application, inference analysis is performed on the non-occurrence time through an event occurrence probability prediction model, and the probability of occurrence of the non-occurrence time is judged and early-warned.
Since the event is often associated and continuous, before a specific event occurs, a specific preamble event often occurs, assuming that the event to be predicted is Tn +1, a sliding time window is SET, a SET of preamble events before the event occurs in history ((T2 … Tn), Tn +1), where Tn is the event that has multiple event types, and all of the Tn events are recorded as a sequence SET _ Tn ═ T1, T2, …).
And taking SET _ Tn as a seed sequence, and randomly selecting a SET of event sequences ((T1 … Tn-1), Tx) in which Tn +1 does not occur historically but Tn occurs, wherein Tx is an event in SET _ Tn.
And mixing the data sets in a and b into a training set by adopting the orders of 1:1, 1:10 and 1:100 respectively. And selecting a machine learning model, calculating the probability of Tn +1 after Tn, and recording the probability as the event inference analysis probability.
Under different mixing proportions (under different noises), different judgment probability values are obtained and recorded as the occurrence probability of Tn +1 events, and finally, the occurrence probability of the event N +1 under the condition of the occurrence (event 2-event N) of the event sequence can be judged through model training.
In the embodiment of the application, the prediction of the non-occurrence events with a certain probability can be realized through the event occurrence probability prediction model, so that the response and processing capacity of the future events can be improved, and potential benefits are brought to financial wind control, production and the like.
The data processing device provided by the invention is described below, and the data processing device described below and the data processing method described above can be referred to correspondingly.
Fig. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, as shown in fig. 5, including: an analysis module 510, a determination module 520, and a processing module 530;
the analysis module 510 is configured to perform event detection and subject identification on the multi-source heterogeneous data to obtain event information and subject information corresponding to the multi-source heterogeneous data;
the determining module 520 is configured to determine, based on the subject information and a preset standardized subject library, relationship information corresponding to the subject information, and perform attribute extraction on the event information to obtain attribute information corresponding to the event information, where the preset standardized subject library includes a plurality of subject information, and each of the subject information is associated with one or more sets of relationship information;
the processing module 530 is configured to obtain event four-tuple information of the multi-source heterogeneous data based on the event information, the subject information, the attribute information, and the relationship information.
According to the embodiment of the application, the attribute information of the evened data is extracted after the semantics in the multi-source heterogeneous data are evened, and the main body of the multi-source heterogeneous data is identified and combined with the preset standardized subject library, so that the relation information is effectively obtained, and the quadruple form data of event information, main body information, attribute information and relation information which can be calculated, identified and analyzed is finally obtained, so that the problem that the multi-source heterogeneous data is difficult to directly analyze by a computer is effectively solved.
Fig. 6 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 6: a processor (processor)610, a communication Interface (Communications Interface)620, a memory (memory)630 and a communication bus 640, wherein the processor 610, the communication Interface 620 and the memory 630 communicate with each other via the communication bus 640. The processor 610 may call logical instructions in the memory 630 to perform a data processing method comprising: event detection and main body identification are carried out on multi-source heterogeneous data, and event information and main body information corresponding to the multi-source heterogeneous data are obtained; determining relation information corresponding to the subject information based on the subject information and a preset standardized subject library, and performing attribute extraction on the event information to obtain attribute information corresponding to the event information, wherein the preset standardized subject library comprises a plurality of subject information, and each subject information is associated with one or more groups of relation information; and obtaining event four-tuple information of the multi-source heterogeneous data based on the event information, the main body information, the attribute information and the relation information.
In addition, the logic instructions in the memory 630 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program being capable of executing, when executed by a processor, the data processing method provided by the above methods, the method including: event detection and main body identification are carried out on multi-source heterogeneous data, and event information and main body information corresponding to the multi-source heterogeneous data are obtained; determining relation information corresponding to the subject information based on the subject information and a preset standardized subject library, and performing attribute extraction on the event information to obtain attribute information corresponding to the event information, wherein the preset standardized subject library comprises a plurality of subject information, and each subject information is associated with one or more groups of relation information; and obtaining event four-tuple information of the multi-source heterogeneous data based on the event information, the main body information, the attribute information and the relation information.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the data processing method provided by the above methods, the method comprising: event detection and main body identification are carried out on multi-source heterogeneous data, and event information and main body information corresponding to the multi-source heterogeneous data are obtained; determining relation information corresponding to the subject information based on the subject information and a preset standardized subject library, and performing attribute extraction on the event information to obtain attribute information corresponding to the event information, wherein the preset standardized subject library comprises a plurality of subject information, and each subject information is associated with one or more groups of relation information; and obtaining event four-tuple information of the multi-source heterogeneous data based on the event information, the main body information, the attribute information and the relation information.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A data processing method, comprising:
event detection and main body identification are carried out on multi-source heterogeneous data, and event information and main body information corresponding to the multi-source heterogeneous data are obtained;
determining relation information corresponding to the subject information based on the subject information and a preset standardized subject library, and performing attribute extraction on the event information to obtain attribute information corresponding to the event information, wherein the preset standardized subject library comprises a plurality of subject information, and each subject information is associated with one or more groups of relation information;
and obtaining event four-tuple information of the multi-source heterogeneous data based on the event information, the main body information, the attribute information and the relation information.
2. The data processing method according to claim 1, wherein the performing event detection and subject identification on the multi-source heterogeneous data to obtain event information and subject information corresponding to the multi-source heterogeneous data specifically comprises:
under the condition that the multi-source heterogeneous data is unstructured, performing textualization processing on the multi-source heterogeneous data to obtain text information corresponding to the multi-source heterogeneous data;
inputting a preset text main body classification model to the text information to obtain a text main body type corresponding to the text information;
inputting the text information into a named entity recognition model corresponding to the text main body type, and outputting first main body information corresponding to the text information;
determining the subject information based on text density information of each of the first subject information when the first subject information is plural, or regarding the first subject information as the subject information when the first subject information is one;
and performing event detection on the text information to obtain event information corresponding to the multi-source heterogeneous data.
3. The data processing method of claim 2, wherein the performing event detection on the text information to obtain event information corresponding to the multi-source heterogeneous data comprises:
respectively inputting the text information into a preset text event classification model and a preset text semantic model, and outputting first event classification information and first semantic event information corresponding to the text information;
carrying out event recognition processing on the sentence set corresponding to the text information to obtain a sentence event set, and carrying out event recognition processing on the paragraph set corresponding to the text information to obtain a paragraph event set;
merging the same type events in the paragraph event set of the sentence event set to obtain a merged target event set;
and determining event information corresponding to the multi-source heterogeneous data in the target event set based on the occurrence frequency of each event in the target event set.
4. The data processing method according to claim 1, wherein the determining, based on the subject information and a preset standardized subject library, relationship information corresponding to the subject information includes:
matching corresponding standard subject information in the preset standardized subject library based on the subject information;
and obtaining the relation information corresponding to the standard main body information to obtain the relation information corresponding to the main body information.
5. The data processing method of any one of claims 1 to 4, further comprising, after obtaining event quadruple information of the multi-source heterogeneous data:
acquiring event history data corresponding to the event information;
analyzing the event duration and the event heat of the event historical data to obtain event duration information and event heat information;
and obtaining event evaluation information corresponding to the event information based on the event duration information, the event heat information and the historical occurrence frequency of the event information.
6. The data processing method of any one of claims 1 to 4, further comprising, after obtaining event quadruple information of the multi-source heterogeneous data:
inputting first event information and second event information in each event four-tuple information into an event occurrence probability prediction model to obtain the prediction probability of the second event information after the first event information occurs;
wherein the first event information and the second event information have a temporal correlation.
7. A data processing apparatus, comprising:
the analysis module is used for carrying out event detection and main body identification on the multi-source heterogeneous data to obtain event information and main body information corresponding to the multi-source heterogeneous data;
the determining module is used for determining relationship information corresponding to the subject information based on the subject information and a preset standardized subject library, and performing attribute extraction on the event information to obtain attribute information corresponding to the event information, wherein the preset standardized subject library comprises a plurality of subject information, and each subject information is associated with one or more groups of relationship information;
and the processing module is used for obtaining event four-tuple information of the multi-source heterogeneous data based on the event information, the main body information, the attribute information and the relation information.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the data processing method according to any of claims 1 to 6 when executing the program.
9. A non-transitory computer-readable storage medium on which a computer program is stored, the computer program, when being executed by a processor, implementing the data processing method according to any one of claims 1 to 6.
10. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the data processing method of any one of claims 1 to 6.
CN202210325701.7A 2022-03-29 2022-03-29 Data processing method, data processing apparatus, electronic device, medium, and program product Pending CN114742055A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210325701.7A CN114742055A (en) 2022-03-29 2022-03-29 Data processing method, data processing apparatus, electronic device, medium, and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210325701.7A CN114742055A (en) 2022-03-29 2022-03-29 Data processing method, data processing apparatus, electronic device, medium, and program product

Publications (1)

Publication Number Publication Date
CN114742055A true CN114742055A (en) 2022-07-12

Family

ID=82278414

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210325701.7A Pending CN114742055A (en) 2022-03-29 2022-03-29 Data processing method, data processing apparatus, electronic device, medium, and program product

Country Status (1)

Country Link
CN (1) CN114742055A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109885698A (en) * 2019-02-13 2019-06-14 北京航空航天大学 A kind of knowledge mapping construction method and device, electronic equipment
CN110489395A (en) * 2019-07-27 2019-11-22 西南电子技术研究所(中国电子科技集团公司第十研究所) Automatically the method for multi-source heterogeneous data knowledge is obtained
CN110598005A (en) * 2019-09-06 2019-12-20 中科院合肥技术创新工程院 Public safety event-oriented multi-source heterogeneous data knowledge graph construction method
CN110717034A (en) * 2018-06-26 2020-01-21 杭州海康威视数字技术股份有限公司 Ontology construction method and device
CN110825882A (en) * 2019-10-09 2020-02-21 西安交通大学 Knowledge graph-based information system management method
CN112639781A (en) * 2018-07-09 2021-04-09 西门子股份公司 Knowledge graph for real-time industrial control system security event monitoring and management
CN112765485A (en) * 2021-01-18 2021-05-07 深圳市网联安瑞网络科技有限公司 Network social event prediction method, system, terminal, computer device and medium
CN113505233A (en) * 2021-06-07 2021-10-15 中国科学院地理科学与资源研究所 Extraction method of ecological civilized geographic knowledge based on open domain
US20220019742A1 (en) * 2020-07-20 2022-01-20 International Business Machines Corporation Situational awareness by fusing multi-modal data with semantic model

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110717034A (en) * 2018-06-26 2020-01-21 杭州海康威视数字技术股份有限公司 Ontology construction method and device
CN112639781A (en) * 2018-07-09 2021-04-09 西门子股份公司 Knowledge graph for real-time industrial control system security event monitoring and management
CN109885698A (en) * 2019-02-13 2019-06-14 北京航空航天大学 A kind of knowledge mapping construction method and device, electronic equipment
CN110489395A (en) * 2019-07-27 2019-11-22 西南电子技术研究所(中国电子科技集团公司第十研究所) Automatically the method for multi-source heterogeneous data knowledge is obtained
CN110598005A (en) * 2019-09-06 2019-12-20 中科院合肥技术创新工程院 Public safety event-oriented multi-source heterogeneous data knowledge graph construction method
CN110825882A (en) * 2019-10-09 2020-02-21 西安交通大学 Knowledge graph-based information system management method
US20220019742A1 (en) * 2020-07-20 2022-01-20 International Business Machines Corporation Situational awareness by fusing multi-modal data with semantic model
CN112765485A (en) * 2021-01-18 2021-05-07 深圳市网联安瑞网络科技有限公司 Network social event prediction method, system, terminal, computer device and medium
CN113505233A (en) * 2021-06-07 2021-10-15 中国科学院地理科学与资源研究所 Extraction method of ecological civilized geographic knowledge based on open domain

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
FATEMEH SHIRI 等: ""Toward the Automated Construction of Probabilistic Knowledge Graphs for the Maritime Domain"", 2021 IEEE 24TH INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION), 2 December 2021 (2021-12-02) *
林瑀;陈日成;金涛;: "面向复杂信息***的多源异构数据融合技术", 中国测试, no. 07, 31 July 2020 (2020-07-31) *
邵琦 等: ""基于语义的突发公共卫生事件网络舆情主题发现研究"", 《数据分析与知识发现》, 30 September 2020 (2020-09-30) *
郑忠斌;宋海涛;费海平;丁镇;: "一种基于时序关联的工业大数据事件融合方法设计", 自动化技术与应用, no. 12, 25 December 2019 (2019-12-25) *

Similar Documents

Publication Publication Date Title
AU2019263758B2 (en) Systems and methods for generating a contextually and conversationally correct response to a query
US20230222366A1 (en) Systems and methods for semantic analysis based on knowledge graph
US7860872B2 (en) Automated media analysis and document management system
CN111950932A (en) Multi-source information fusion-based comprehensive quality portrait method for small and medium-sized micro enterprises
Malik et al. Accurate information extraction for quantitative financial events
CN110941702A (en) Retrieval method and device for laws and regulations and laws and readable storage medium
CN114218958A (en) Work order processing method, device, equipment and storage medium
Li et al. Stock market analysis using social networks
Kamaruddin et al. A text mining system for deviation detection in financial documents
CN112487808A (en) Big data based news message pushing method, device, equipment and storage medium
CN112632958A (en) Contract document examination and analysis method based on contract knowledge base
CN109542845B (en) Text metadata extraction method based on keyword expression
CN111104422A (en) Training method, device, equipment and storage medium of data recommendation model
KR20210001649A (en) A program for predicting corporate default
CN112487181A (en) Keyword determination method and related equipment
CN114742055A (en) Data processing method, data processing apparatus, electronic device, medium, and program product
CN115994531A (en) Multi-dimensional text comprehensive identification method
CN115618085A (en) Interface data exposure detection method based on dynamic label
Kelly News, sentiment and financial markets: A computational system to evaluate the influence of text sentiment on financial assets
CN113515587A (en) Object information extraction method and device, computer equipment and storage medium
Roelands et al. Classifying businesses by economic activity using web-based text mining
CN116881504B (en) Image information digital management system and method based on artificial intelligence
CN111008874B (en) Technical trend prediction method, system and storage medium
Sulaiman et al. South China Sea Conflicts Classification Using Named Entity Recognition (NER) and Part-of-Speech (POS) Tagging
Zhang et al. Exchange rate modelling for e-negotiators using text mining techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination