WO2024131091A1 - Information association method and apparatus, device, and storage medium - Google Patents

Information association method and apparatus, device, and storage medium Download PDF

Info

Publication number
WO2024131091A1
WO2024131091A1 PCT/CN2023/112709 CN2023112709W WO2024131091A1 WO 2024131091 A1 WO2024131091 A1 WO 2024131091A1 CN 2023112709 W CN2023112709 W CN 2023112709W WO 2024131091 A1 WO2024131091 A1 WO 2024131091A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
stock
entity
piece
individual
Prior art date
Application number
PCT/CN2023/112709
Other languages
French (fr)
Chinese (zh)
Inventor
王嘉楠
潘康
Original Assignee
深圳市富途网络科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市富途网络科技有限公司 filed Critical 深圳市富途网络科技有限公司
Publication of WO2024131091A1 publication Critical patent/WO2024131091A1/en

Links

Definitions

  • Individual stocks are a type of securities with no repayment period. They can be divided into three types according to the stock holders: state-owned stocks, corporate stocks, and individual stocks. Individual stocks are invested by individuals and can be freely listed and circulated.
  • Information-related stock services are crucial. From the perspective of operational distribution, after a piece of financial information is stored in the database, when distributing the information, it is necessary to consider which stocks the information is most relevant to, and put the information under the information list of these stocks, to provide the most timely and relevant information to users who pay attention to these stocks. From the perspective of user experience, users need to quickly locate which stocks the information is related to while reading the information, and can directly access the individual stock trading quotation page from the information page, thereby helping users make more accurate investment decisions and place orders more quickly.
  • the information side uses a method of fully matching the stock names to associate stocks. This method can only match when the stock names appear completely, so the distribution efficiency is low, and there are many cases of incorrect association.
  • the embodiments of the present application provide an information association method, apparatus, device and storage medium, which can improve the accuracy of association between information and individual stocks and improve the efficiency of information distribution.
  • an embodiment of the present application provides an information association method, the method comprising:
  • the associated stocks corresponding to each piece of information in the information set are determined, and the associated stocks represent stocks related to the corresponding information.
  • an information association device comprising:
  • An extraction unit used to extract entity information of each piece of information in the information set, wherein the entity information includes at least one entity
  • a first determining unit is used to determine global statistical relationship information based on the stock set, the information set and the entity information of each piece of information in the information set;
  • the second confirmation unit is used to determine the associated stocks corresponding to each piece of information in the information set according to the global statistical relationship information, and the associated stocks represent stocks related to the corresponding information.
  • an embodiment of the present application provides a computer device, which includes a processor and a memory, wherein a computer program is stored in the memory, and the processor is used to execute the information association method described in any of the above embodiments by calling the computer program stored in the memory.
  • an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program is suitable for being loaded by a processor to execute the information association method described in any of the above embodiments.
  • the embodiment of the present application extracts entity information of each piece of information in the information set, where the entity information includes at least one entity, and then determines global statistical relationship information based on the stock set, the information set, and the entity information of each piece of information in the information set. Then, based on the global statistical relationship information, the associated stocks corresponding to each piece of information in the information set are determined.
  • the associated stocks represent stocks related to the corresponding information, which can improve the accuracy of the association between information and stocks and improve the efficiency of information distribution.
  • FIG1 is a flow chart of an information association method provided in an embodiment of the present application.
  • FIG. 2 is a schematic diagram of an application scenario provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of the structure of an information association device provided in an embodiment of the present application.
  • FIG. 4 is a schematic diagram of the structure of a server provided in an embodiment of the present application.
  • the embodiments of the present application provide an information association method, apparatus, terminal device, and storage medium.
  • the information association method of the embodiments of the present application can be executed by a computer device, wherein the computer device can be a terminal or a server.
  • the terminal can be a smart phone, a tablet computer, a laptop computer, a desktop computer, Smart TVs, smart speakers, wearable smart devices, smart car terminals and other devices.
  • the terminal may also include a client, which may be a financial client, a browser client or an instant messaging client, etc.
  • the server may be an independent physical server, or a server cluster or distributed system consisting of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content distribution network services, and big data and artificial intelligence platforms, but is not limited to these.
  • basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content distribution network services, and big data and artificial intelligence platforms, but is not limited to these.
  • the information can only be linked to the stock when the stock name appears in full. For example, when “Company A Holdings” appears in the information, it can be linked to the stock 00700.HK, but when only "Company A” appears, it cannot be linked to the stock 00700.HK.
  • the embodiment of the present application can be automatically distributed to the individual stock information list through system review and intelligent association of stocks.
  • the ability to associate individual stocks can be greatly improved, the efficiency of information distribution can be improved, and the user's demand for timely browsing of individual stock related information can be met.
  • Relative to the situation where the individual stock names are completely matched the embodiment of the present application has been further optimized.
  • representative entities other than company names such as company executives, company products, company businesses, and the industry in which the company is located in the information can also be associated with individual stocks.
  • New association links can also be manually configured in the background operation system or existing association links can be activated to improve the accuracy of the association between information and individual stocks.
  • Figure 1 is a flow chart of the information association method provided in the embodiment of the present application
  • Figure 2 is a schematic diagram of the application scenario provided in the embodiment of the present application.
  • the information association method of the embodiment of the present application can be applied to a server. The method comprises the following steps:
  • Step 110 extracting entity information of each piece of information in the information set, wherein the entity information includes at least one entity.
  • the method before extracting entity information of each piece of information in the information set, the method further includes:
  • an information set is obtained, where the information set includes all the information in all the individual stock information lists.
  • each stock has accumulated a certain amount of stock information news, and the stock information news is provided to users for browsing on the relevant client.
  • the stock information news is provided to users for browsing on the relevant client.
  • the entity extraction result obtained includes the entity information of each piece of information in the information set, and an information-entity network can be preliminarily constructed.
  • a scheduled task can be set to obtain the newly added stock information within the time period from the Kafka message queue at multiple time points within a day, and perform entity extraction in batches.
  • the entity extraction results are also stored in the database.
  • Kafka is a distributed message queue with high performance, persistence, multi-copy backup, and horizontal expansion capabilities. It plays the role of decoupling, peak shaving, and asynchronous processing in the architecture. The biggest feature of Kafka is that it can process large amounts of data in real time to meet various demand scenarios. Get information.
  • the acquisition method of an article of information may include crawling the content source, automatically reviewing and storing it in the database, or manually creating and storing it in the database. For example, crawling information based on the information source and storing it in the database. For example, based on the copyright cooperation with the original website of the information, crawling information from the information source based on the crawler tool.
  • the crawler tool is a program or script that automatically crawls World Wide Web information according to certain rules.
  • the crawler tool initiates a request to the target site through the HTTP library, that is, sending a Request, which can contain additional headers and other information, and waits for the server to respond; if the server can respond normally, a Response will be obtained, and the content of the Response is the content of the page to be obtained, which may be of HTML, Json string, binary data (such as pictures and videos), etc.; the obtained content can be HTML, which can be parsed using regular expressions and web page parsing libraries; the obtained content can also be Json, which can be directly converted to Json object parsing, generally binary data, which can be saved or further processed; the information crawled by the crawler tool can be saved as text, can also be saved in a database, or saved in a file of a specific format. For example, the information may also be manually created and stored in the database, in response to a storage request for the information sent by the manual review platform, the information may be obtained and stored in the database.
  • each piece of information can be reviewed based on the preset review rules, wherein the preset review rules at least include sensitive word matching and filtering rule verification; if the information does not hit the sensitive word and the information does not hit the filtering rule, the information is stored in the database. If the information hits the sensitive word and/or the information hits the filtering rule, a first prompt message indicating that the review has not passed can be generated, and the information and the first prompt message can be sent to the manual review platform to allow the user to determine whether to store the information.
  • the preset review rules at least include sensitive word matching and filtering rule verification
  • the sensitive word library and the filtering rule word library are pre-selected word libraries.
  • the title, source, text and other fields of the information are matched. If a sensitive word or a filtering word is hit, it is automatically judged as unapproved. The operator needs to conduct manual review before deciding whether to store the information. For example, in response to a maintenance instruction for stored information, the corresponding information can be stored in the stock information list of the corresponding stock.
  • extracting entity information of each piece of information in the information set includes:
  • the information text data of each piece of information is processed based on an entity extraction model to obtain entity information of each piece of information, wherein the entity extraction model is used to extract preset entities in the information text data.
  • the entity extraction model has a callable interface call, and the entity extraction model can be directly called through the corresponding interface to extract the entity information of each piece of information in the information set.
  • the core of information is to report news events, which describe events that occur in one or more entities (such as companies, people, countries, organizations, etc.).
  • entities such as companies, people, countries, organizations, etc.
  • the entity extraction model needs to accurately extract the preset entities that appear in the information, including but not limited to: company, person name, product name, business, industry, country, organization, etc.
  • information a includes the content "Li, founder and CEO of A Automobile Company, officially announced that XX type car will be released on X month X day”.
  • entity information of information a extracted by the entity extraction model may include the following entities: the company name is "A Automobile Company"; the person name is "Li”, and the product name is "XX type car”.
  • the system determines the correlation between each entity and all stocks, and then the core algorithm combines the correlation between all entities and stocks to get the final correlation result. Therefore, the entity extraction module is the first step in the overall architecture of associating stocks.
  • the entity extraction model mainly performs data labeling on entity types such as companies, names of people, product names, businesses, industries, countries, and organizations that are of concern to the financial field.
  • the Bert model is built from an embedding layer and 12 transformer layers, with a total of 110 million parameters, and the model parameters are very large.
  • the Bert model is used as a text_encoder to extract features from the input information text data.
  • the input information text data will first pass through the Tokenizer in Bert to obtain a token sequence of length L.
  • the token sequence will further convert the token text into word id according to the mapping relationship in vocab to obtain an input tensor of [1,L], which is then input into the Bert model.
  • the role of the full pointer layer is to use text_encoder to extract the rich semantic information of the entity, and to indicate the head and tail of the entity at one time through a pointer matrix, so as to quickly locate the position of the entity in the original text for direct extraction.
  • a simplified version of the Multi-Head Attention module can be used to implement this function.
  • the Multi-Head Attention module performs matrix calculations on three matrices Q (query), K (all keys), and V (values), and then performs Scaled Dot-Product Attention calculations.
  • S ⁇ (i,j) represents the pointer matrix of the ⁇ th entity, and its shape is [L,L].
  • n_labels classes each entity category will calculate such a pointer matrix, so the output of the entire full pointer layer is a tensor of [n_labels,L,L].
  • the rows of S ⁇ represent the entity head position and the columns represent the entity tail position. Therefore, although S ⁇ is a square matrix, only the upper triangle part has practical significance, and the output of the lower triangle is directly ignored.
  • the function of the classification output layer is to extract entities from the output pointer matrix. Values greater than 0 in [n_labels,L,L] are considered to be activated entity heads and tails. Therefore, this layer converts the logits output by the model into a 0/1 binary matrix. The activated entity heads and tails are set to 1, and the rest are set to 0.
  • the information text data of the sample information obtained is manually annotated, for example, by annotating the entity types such as company, person name, product name, business, industry, country, organization, etc. in the information text data of the sample information to obtain the annotated data.
  • the annotated data is input into the preset algorithm model for model training, such as using the annotated data and the Bert model to pre-train multiple transformer layers, global pointer layers, and classification output layers to obtain an entity extraction model.
  • the information text data of each piece of information in the information set is input into the entity extraction model to extract the entity information of each piece of information.
  • Step 120 determining global statistical relationship information based on the stock set, the information set, and the entity information of each piece of information in the information set.
  • step 120 can be implemented based on the Spark computing task system.
  • Spark is a big data parallel computing framework based on memory computing. Spark improves the real-time performance of data processing in a big data environment based on the characteristics of memory computing, while ensuring high fault tolerance and high scalability, allowing users to deploy Spark on a large number of cheap hardware to form a cluster, thereby improving parallel computing capabilities.
  • the Spark computing task system can be responsible for converting and aggregating individual stocks, massive amounts of individual stock information, and tens of millions of entities extracted from them, based on the MapReduce concept, to obtain global statistical relationship information between objects of different categories.
  • the global statistical relationship information can represent the relationship between individual stocks and entities, the relationship between entities, etc.
  • MapReduce is a programming model used for parallel computing of large-scale data sets (greater than 1TB).
  • the global statistical relationship information includes the first co-occurrence relationship between different entities and different stocks, the second co-occurrence relationship between different entities, and the first correlation between different entities and different stocks; for example, the global statistical relationship information may include: the first co-occurrence relationship between stocks and entities, the second co-occurrence relationship between entities and entities, the global IDF value (Inverse Document Frequency, abbreviated as IDF) of entities, the MF value (entity frequency, Mention Frequency, abbreviated as MF) of stocks and entities, the MFIDF value (entity frequency-inverse document frequency), etc.
  • IDF Inverse Document Frequency, abbreviated as IDF
  • MF entity frequency
  • Mention Frequency abbreviated as MFIDF value
  • MFIDF value entity frequency-inverse document frequency
  • step 120 may be implemented by steps 121 to 123 (not shown in the figure), specifically:
  • Step 121 determining the first co-occurrence relationship between each entity in each piece of information and each stock in the stock set according to the stock set, the information set and the entity information of each piece of information in the information set.
  • determining the first co-occurrence relationship between each entity in each piece of information and each stock in the stock set based on the stock set, the information set, and the entity information of each piece of information in the information set includes:
  • the individual stock information list corresponding to each individual stock in the individual stock set and the information set determining the individual stock with an initial association relationship corresponding to each piece of information in the information set;
  • the first co-occurrence relationship between each entity in each piece of information and each stock in the stock set is determined.
  • the first co-occurrence relationship between stocks and entities for information a in the information set, according to the stock information list and the information set, it is determined that information a is associated with n stocks, and m entities can be extracted from information a. Then, corresponding to information a, n ⁇ m "stock-entity" pairs will be generated.
  • stocks and entities are paired in pairs (for example, pairing in pairs means that the two appear together), they are considered to co-occur once. Therefore, the frequency of each paired "stock-entity" pair is added by 1, and finally the total number of global co-occurrences of a certain stock and a certain entity can be obtained.
  • All "stock-entity” pairs are traversed to obtain the first co-occurrence relationship between each entity in each piece of information and each stock in the stock set.
  • the first co-occurrence relationship can represent the degree of correlation between entities and stocks.
  • the first co-occurrence relationship is helpful for users to understand which entities are most relevant to a certain stock. For example, the stock 00700.HK often co-occurs with the entity "King of Glory", indicating that the correlation between the two is high.
  • Step 122 determining a second co-occurrence relationship between entities in each piece of information based on the entity information of each piece of information in the information set.
  • the second co-occurrence relationship between entities Similarly, for information a in the information set, for example, m entities can be extracted from information a, and different entities in information a are paired with each other to form m ⁇ m "entity-entity" pairs. When entities are paired with each other (for example, pairing with each other means that the two appear together), they are considered to co-occur once. Finally, the total number of global co-occurrences of a certain entity with another entity can be obtained. All "entity-entity" pairs are traversed to obtain the second co-occurrence relationship between each entity in each piece of information. This second co-occurrence relationship can represent the degree of correlation between entities. This second co-occurrence relationship helps us understand the degree of correlation between entities. For example, the entity "WeChat" often co-occurs with the entity "official account”, indicating that the correlation between the two is high.
  • Step 123 based on the stock set, the information set and the entity information of each piece of information in the information set, Determine a first correlation between each stock in the stock set and each entity in each piece of information.
  • determining the first degree of association between each stock in the stock set and each entity in each piece of information based on the stock set, the information set, and the entity information of each piece of information in the information set includes:
  • the information set and the entity information of each piece of information in the information set determine the number of co-occurrence pieces of information of the i-th entity in each piece of information and the j-th stock in the individual stock set, determine the total number of pieces of information corresponding to the j-th stock, determine the total number of pieces of information in the information set, and determine the total number of pieces of information in which the i-th entity appears in the information set;
  • Each entity in each piece of information in the information set is traversed to determine a first correlation between each stock in the stock set and each entity in each piece of information.
  • the entity frequency MF ij of the j-th stock relative to the i-th entity can be determined, which can be expressed as the following formula (1):
  • the inverse document frequency IDF i of the i-th entity is determined, which can be expressed as the following formula (2):
  • the total number of pieces of information in which entity i appears is added by 1 to avoid the denominator being 0 (i.e., all information does not contain the entity); log means taking the logarithm of the obtained value.
  • a Spark task result is obtained, which may include global statistical relationship information, and then the obtained Spark task result is directly written into the Hive table.
  • a Hive table is generated for all first co-occurrence relationships in the information set (including the first co-occurrence relationship between the stock and the entity corresponding to each piece of information); a Hive table is generated for all second co-occurrence relationships in the information set (including the second co-occurrence relationship between the entity and the entity corresponding to each piece of information); and multiple Hive tables may correspond to all first correlations in the stock set (including the first correlation between each stock and each entity in each piece of information), one of which is used to store MF values, another Hive table is used to store IDF values, and another Hive table is used to store MFIDF values.
  • the Hive table can be synchronized to the MySQL table to use the MySQL index to quickly query data.
  • the Hive table or the corresponding MySQL table storing the Spark task results may be recorded as a stock_mention table.
  • Step 130 determining the associated stocks corresponding to each piece of information in the information set according to the global statistical relationship information, wherein the associated stocks represent stocks related to the corresponding information.
  • determining the associated stocks corresponding to each piece of information in the information set according to the global statistical relationship information includes:
  • the target stock is determined as an associated stock of the target information
  • the target stock is any stock in the stock set, and the target information is any information in the information set.
  • the results stored in the stock_mention table can be used to determine the associated stocks corresponding to each piece of information in the information set.
  • the co-occurrence entity list corresponding to each stock in the stock set can be determined based on the first co-occurrence relationship and the second co-occurrence relationship in the stock_mention table. For example, given any stock A, all entities that have co-occurred with stock A can be returned to obtain the co-occurrence entity list corresponding to stock A, and the most relevant entity list of stock A can be preliminarily obtained according to the descending order of the MFIDF value representing the first correlation and the TopN can be taken; the most relevant entity list of stock A can be traversed to determine the candidate entity corresponding to each stock in the stock set.
  • a manual review platform can be added to check the most relevant entities of each stock (for example, more than 20,000 stocks) one by one.
  • the candidate entity corresponding to each stock in the stock set is sent to the manual review platform.
  • the association link identifier between each stock and the corresponding candidate entity is generated and returned to the server, so that the server obtains the association link identifier between each stock and the corresponding candidate entity.
  • the target stock is determined as the associated stock of the target information.
  • the embodiment of the present application can provide the ability to add one or more association links between individual stocks and entities, and can easily perform operations such as adding, deleting, activating, and deactivating on the manual review platform.
  • the modified association link can be reflected in the association results of the online service in a timely manner.
  • the manual operation table generated by the manual review platform operation is based on the basic data of the stock_mention table, and the manual operation record of the individual stock-entity pair is added, which is recorded as the ops_stock_mention table.
  • the ops_stock_mention table contains the association link identifier between each individual stock and the corresponding candidate entity.
  • the core of associating information with individual stocks lies in how to determine the correlation between a certain entity and each individual stock in the individual stock set, that is, it is necessary to design a mechanism or algorithm so that multiple entities extracted from the information have the highest correlation with individual stocks, such as the entity package "A Automobile Company", “Li”, and "XX Model Automobile", which have the highest correlation with individual stock b, while the correlation with other individual stocks is relatively low.
  • the entities extracted from the information are associated with individual stocks, thereby achieving the purpose of associating information with individual stocks.
  • the manually activated entity-individual stock link obtained in step 130 can be used for direct association.
  • the corresponding target individual stocks are directly returned as associated stocks.
  • the association link identifiers between the target stock and the corresponding candidate entities meet the preset conditions, and the candidate entities corresponding to the target stock all belong to the target information, the target stock is determined as the associated stock of the target information.
  • an information set is obtained from the information database; then entity extraction is performed on the information set to extract the entity information of each piece of information in the information set, and the entity information is stored in Table 1; then the Spark task result is processed based on the Spark timing processing program of the Spark computing task system, and the Spark task result is written into Table 2, which is a Hive table.
  • Table 3 is a MySQL table and Table 2 is a Hive table. Since Hive does not support indexes, the query is slow and not suitable for online query, the Spark task result of Table 2 needs to be synchronized to the MySQL medium for storage. Therefore, the data content of Table 2 and Table 3 is the same, but the storage medium is different, one is in the Hive storage medium, and the other is in the MySQL storage medium.
  • the native Spark task result can well support writing to Hive, but it is not easy to write to MySQL, so this data synchronization operation is required to synchronize the data of Table 2 to Table 3.
  • the "target stocks stored in the database” in Figure 2 means that when the system is initially launched, only the Spark task results corresponding to the popular stocks (target stocks) are stored in Table 3 (MySQL table); when the system is subsequently launched, all the data of all stocks can be synchronized to Table 3 (MySQL table).
  • the operation table stores the associated results retained after manual review of the data in Table 3. Since the number of data rows involved in Table 3 is very large, at the level of tens of millions, it is not suitable for real-time synchronization. Therefore, the data in Table 3 can be processed regularly and the processing results can be updated and merged into the operation table.
  • all fields in the operation table are from Table 3, and an active field is added to the operation table compared to Table 3.
  • the operation staff may have modified the operation table, such as setting the active of some rows to 1 or 0, or adding rows, and the manual review results determine that a certain entity (mention) needs to be associated with a certain stock; 2) For Table 3, new data rows may also be added during this period, the number of co-occurrences may increase, etc. 3)
  • all updated data in Table 3 are merged into the operation table.
  • the operation view of different stocks can also be presented in real time, based on the operation staff modifying or adding operations to the operation table through the entity management background, and updating the data corresponding to the modification or addition operations to the operation table.
  • the associated stocks corresponding to each piece of information in the information set can be determined, and the associated stocks corresponding to different pieces of information can be presented on the user interface of the client.
  • the method further includes: displaying the associated stocks.
  • the associated stocks are multiple associated stocks
  • the multiple associated stocks can be displayed on the user interface of the client in a preset sorting manner.
  • the preset sorting may include sorting the release time of the multiple associated stocks, sorting the number of user positions corresponding to the multiple associated stocks, sorting the user click rate corresponding to the multiple associated stocks, etc.
  • the interface protocol of the service can be formulated in cooperation with the developers on the business side as follows:
  • each associated stock in the list of associated stocks needs to include the stock ID, stock code, stock Chinese abbreviation, stock-related entity list, etc.
  • Each entity also needs to include results such as stock-entity correlation.
  • the information association method provided in the embodiment of the present application extracts the entity information of each piece of information in the information set, the entity information includes at least one entity, and then determines the global statistical relationship information based on the stock set, the information set and the entity information of each piece of information in the information set, and then determines the associated stocks corresponding to each piece of information in the information set based on the global statistical relationship information.
  • the associated stocks represent the stocks related to the corresponding information, which can improve the accuracy of the association between information and stocks and improve the efficiency of information distribution.
  • the embodiment of the present application optimizes the problems of false association and missed association caused by the individual stock association method with a complete match of the stock name, and can more accurately determine the degree of association between individual stocks and information to improve the accuracy of the association between information and individual stocks.
  • the embodiment of the present application also provides a client.
  • Figure 3 is a schematic diagram of the structure of the information association device provided by the embodiment of the present application.
  • the information association device 200 may include:
  • An extraction unit 210 configured to extract entity information of each piece of information in the information set, wherein the entity information includes at least one entity;
  • a first determining unit 220 configured to determine global statistical relationship information based on the stock set, the information set, and entity information of each piece of information in the information set;
  • the second confirmation unit 230 is used to determine the associated stocks corresponding to each piece of information in the information set according to the global statistical relationship information, and the associated stocks represent stocks related to the corresponding information.
  • the first determining unit 220 is configured to:
  • a first correlation degree between each individual stock in the individual stock set and each entity in each piece of information is determined.
  • the extraction unit 210 is further configured to:
  • an information set is obtained, where the information set includes all the information in all the individual stock information lists.
  • the first determining unit 220 when determining the first co-occurrence relationship between each entity in each piece of information and each stock in the stock set based on the stock set, the information set, and the entity information of each piece of information in the information set, the first determining unit 220 is used to:
  • the individual stock information list corresponding to each individual stock in the individual stock set and the information set determining the individual stock with an initial association relationship corresponding to each piece of information in the information set;
  • the first co-occurrence relationship between each entity in each piece of information and each individual stock in the individual stock set is determined.
  • the first determining unit 220 when determining the first degree of association between each stock in the stock set and each entity in each piece of information based on the stock set, the information set, and the entity information of each piece of information in the information set, the first determining unit 220 is used to:
  • the information set and the entity information of each piece of information in the information set determine the number of co-occurrence pieces of information of the i-th entity in each piece of information and the j-th stock in the individual stock set, determine the total number of pieces of information corresponding to the j-th stock, determine the total number of pieces of information in the information set, and determine the total number of pieces of information in which the i-th entity appears in the information set;
  • Each entity in each piece of information in the information set is traversed to determine a first correlation between each stock in the stock set and each entity in each piece of information.
  • the second confirmation unit 230 is configured to:
  • the target stock is determined as an associated stock of the target information
  • the target stock is any stock in the stock set, and the target information is any information in the information set.
  • the extraction unit 210 is used to:
  • the information text data of each piece of information is processed based on an entity extraction model to obtain entity information of each piece of information, wherein the entity extraction model is used to extract preset entities in the information text data.
  • the information association device embodiment and the method embodiment can correspond to each other, and similar descriptions can refer to the method embodiment. To avoid repetition, no further description is given here.
  • the information association device shown in the figure can execute the above-mentioned information association method embodiment, and the aforementioned and other operations and/or functions of each unit in the information association device respectively implement the corresponding processes of the above-mentioned method embodiment, which will not be repeated here for the sake of brevity.
  • the present application further provides a computer device, including a memory and a processor, wherein a computer program is stored in the memory, and the processor implements the steps in the above-mentioned method embodiments when executing the computer program.
  • FIG4 is a schematic diagram of the structure of a computer device provided in an embodiment of the present application, and the computer device may be a terminal or a server.
  • the computer device 300 may include: a communication interface 301, a memory 302, a processor 303 and a communication bus 304.
  • the communication interface 301, the memory 302, and the processor 303 communicate with each other through the communication bus 304.
  • the communication interface 301 is used for the computer device 300 to communicate data with an external device.
  • the memory 302 may be used to store software programs and modules, and the processor 303 runs the software programs and modules stored in the memory 302, such as the software programs of the corresponding operations in the aforementioned method embodiment.
  • the processor 303 may call the software program and module stored in the memory 302 to perform the following operations:
  • Extract entity information of each piece of information in the information set wherein the entity information includes at least one entity; determine global statistical relationship information based on the stock set, the information set, and the entity information of each piece of information in the information set; determine the associated stocks corresponding to each piece of information in the information set based on the global statistical relationship information, wherein the associated stocks represent stocks related to the corresponding information.
  • the present application embodiment provides a computer-readable storage medium, in which multiple computer programs are stored, and the computer program can be loaded by a processor to execute the steps in any one of the information association methods provided in the present application embodiment.
  • the specific implementation of each of the above operations can be referred to the previous embodiments, and will not be repeated here.
  • the storage medium may include: Read Only Memory (ROM), Random Access Memory (RAM), disk or CD, etc.
  • the computer program stored in the storage medium can execute the steps in any one of the information association methods provided in the embodiments of the present application, the beneficial effects that can be achieved by any one of the information association methods provided in the embodiments of the present application can be achieved. Please refer to the previous embodiments for details and will not be repeated here.
  • the embodiment of the present application also provides a computer program product, which includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the corresponding process in any one of the information association methods in the embodiment of the present application, which will not be described here for the sake of brevity.
  • the embodiment of the present application also provides a computer program, which includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the corresponding process in any one of the information association methods in the embodiment of the present application, which will not be described here for the sake of brevity.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Optimization (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Operations Research (AREA)
  • Mathematical Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Algebra (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • Human Resources & Organizations (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

Disclosed in the present application are an information association method and apparatus, a device, and a storage medium. The method comprises: extracting entity information from each piece of information in an information set, the entity information comprising at least one entity; then according to an individual stock set, the information set and the entity information of each piece of information in the information set, determining global statistical relationship information; and, then according to the global statistical relationship information, determining an associated individual stock corresponding to each piece of information in the information set, the associated individual stock representing an individual stock associated with the corresponding information. The present application may improve the accuracy of association between information and individual stocks, and improve the information distribution efficiency.

Description

信息关联方法、装置、设备及存储介质Information association method, device, equipment and storage medium
优先权信息Priority information
本公开要求于2022年12月21日提交的、申请名称为“信息关联方法、装置、设备及存储介质”的、中国专利申请号“202211660198.7”的优先权,该申请的全部内容通过引用结合在本公开中。This disclosure claims priority to Chinese patent application number "202211660198.7" filed on December 21, 2022, with application name "Information association method, device, equipment and storage medium", and the entire contents of which are incorporated by reference into this disclosure.
技术领域Technical Field
本申请涉及计算机技术领域,具体涉及一种信息关联方法、装置、设备及存储介质。The present application relates to the field of computer technology, and in particular to an information association method, device, equipment and storage medium.
背景技术Background technique
个股是一种无偿还期限的有价证券,按股票持有者可分为国家股、法人股、个人股三种。个人股(individual stock)投资资金来自个人,可以自由上市流通。Individual stocks are a type of securities with no repayment period. They can be divided into three types according to the stock holders: state-owned stocks, corporate stocks, and individual stocks. Individual stocks are invested by individuals and can be freely listed and circulated.
资讯关联个股服务至关重要。从运营分发角度的角度来看,一篇财经类资讯入库后,资讯分发时,需要考虑资讯与哪些个股最相关,并将该资讯投放到这些个股资讯列表下,为关注这些个股的用户提供最及时、最相关的资讯。从用户体验的角度来看,用户在阅读资讯过程中需要快速定位该资讯与哪些个股相关,可以从资讯页直接触达到个股交易行情页,从而帮助用户更准确地进行投资决策、更快速地交易下单。Information-related stock services are crucial. From the perspective of operational distribution, after a piece of financial information is stored in the database, when distributing the information, it is necessary to consider which stocks the information is most relevant to, and put the information under the information list of these stocks, to provide the most timely and relevant information to users who pay attention to these stocks. From the perspective of user experience, users need to quickly locate which stocks the information is related to while reading the information, and can directly access the individual stock trading quotation page from the information page, thereby helping users make more accurate investment decisions and place orders more quickly.
目前资讯侧使用个股名称完全匹配的方式进行关联个股,该方式只能在个股名称完整出现时才能匹配到,分发效率不高,同时错误关联的情况较多。Currently, the information side uses a method of fully matching the stock names to associate stocks. This method can only match when the stock names appear completely, so the distribution efficiency is low, and there are many cases of incorrect association.
发明内容Summary of the invention
本申请实施例提供一种信息关联方法、装置、设备及存储介质,可以提升资讯与个股的关联准确度,提高资讯分发效率。The embodiments of the present application provide an information association method, apparatus, device and storage medium, which can improve the accuracy of association between information and individual stocks and improve the efficiency of information distribution.
一方面,本申请实施例提供一种信息关联方法,所述方法包括:On the one hand, an embodiment of the present application provides an information association method, the method comprising:
提取资讯集中每篇资讯的实体信息,所述实体信息包括至少一个实体;Extracting entity information of each piece of information in the information set, wherein the entity information includes at least one entity;
根据个股集、资讯集以及所述资讯集中每篇资讯的实体信息,确定全局统计关系信息;Determine global statistical relationship information based on the stock set, the information set, and entity information of each piece of information in the information set;
根据所述全局统计关系信息,确定所述资讯集中每篇资讯对应的关联个股,所述关联个股表示与对应的资讯相关的个股。According to the global statistical relationship information, the associated stocks corresponding to each piece of information in the information set are determined, and the associated stocks represent stocks related to the corresponding information.
另一方面,本申请实施例提供一种信息关联装置,所述装置包括:On the other hand, an embodiment of the present application provides an information association device, the device comprising:
提取单元,用于提取资讯集中每篇资讯的实体信息,所述实体信息包括至少一个实体; An extraction unit, used to extract entity information of each piece of information in the information set, wherein the entity information includes at least one entity;
第一确定单元,用于根据个股集、资讯集以及所述资讯集中每篇资讯的实体信息,确定全局统计关系信息;A first determining unit is used to determine global statistical relationship information based on the stock set, the information set and the entity information of each piece of information in the information set;
第二确单元,用于根据所述全局统计关系信息,确定所述资讯集中每篇资讯对应的关联个股,所述关联个股表示与对应的资讯相关的个股。The second confirmation unit is used to determine the associated stocks corresponding to each piece of information in the information set according to the global statistical relationship information, and the associated stocks represent stocks related to the corresponding information.
另一方面,本申请实施例提供一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有计算机程序,所述处理器通过调用所述存储器中存储的所述计算机程序,用于执行如上任一实施例所述的信息关联方法。On the other hand, an embodiment of the present application provides a computer device, which includes a processor and a memory, wherein a computer program is stored in the memory, and the processor is used to execute the information association method described in any of the above embodiments by calling the computer program stored in the memory.
另一方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序适于处理器进行加载,以执行如上任一实施例所述的信息关联方法。On the other hand, an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program is suitable for being loaded by a processor to execute the information association method described in any of the above embodiments.
本申请实施例通过提取资讯集中每篇资讯的实体信息,实体信息包括至少一个实体,然后根据个股集、资讯集以及资讯集中每篇资讯的实体信息,确定全局统计关系信息,然后根据全局统计关系信息,确定资讯集中每篇资讯对应的关联个股,关联个股表示与对应的资讯相关的个股,可以提升资讯与个股的关联准确度,提高资讯分发效率。The embodiment of the present application extracts entity information of each piece of information in the information set, where the entity information includes at least one entity, and then determines global statistical relationship information based on the stock set, the information set, and the entity information of each piece of information in the information set. Then, based on the global statistical relationship information, the associated stocks corresponding to each piece of information in the information set are determined. The associated stocks represent stocks related to the corresponding information, which can improve the accuracy of the association between information and stocks and improve the efficiency of information distribution.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required for use in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For those skilled in the art, other drawings can be obtained based on these drawings without creative work.
图1为本申请实施例提供的信息关联方法的流程示意图。FIG1 is a flow chart of an information association method provided in an embodiment of the present application.
图2为本申请实施例提供的应用场景示意图。FIG. 2 is a schematic diagram of an application scenario provided in an embodiment of the present application.
图3为本申请实施例提供的信息关联装置的结构示意图。FIG3 is a schematic diagram of the structure of an information association device provided in an embodiment of the present application.
图4为本申请实施例提供的服务器的结构示意图。FIG. 4 is a schematic diagram of the structure of a server provided in an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will be combined with the drawings in the embodiments of the present application to clearly and completely describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those skilled in the art without creative work are within the scope of protection of this application.
本申请实施例提供一种信息关联方法、装置、终端设备和存储介质。具体地,本申请实施例的信息关联方法可以由计算机设备执行,其中,该计算机设备可以为终端或者服务器等设备。该终端可以为智能手机、平板电脑、笔记本电脑、台式计算机、 智能电视、智能音箱、穿戴式智能设备、智能车载终端等设备,终端还可以包括客户端,该客户端可以是金融客户端、浏览器客户端或即时通信客户端等。服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式***,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络服务、以及大数据和人工智能平台等基础云计算服务的云服务器,但并不局限于此。The embodiments of the present application provide an information association method, apparatus, terminal device, and storage medium. Specifically, the information association method of the embodiments of the present application can be executed by a computer device, wherein the computer device can be a terminal or a server. The terminal can be a smart phone, a tablet computer, a laptop computer, a desktop computer, Smart TVs, smart speakers, wearable smart devices, smart car terminals and other devices. The terminal may also include a client, which may be a financial client, a browser client or an instant messaging client, etc. The server may be an independent physical server, or a server cluster or distributed system consisting of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content distribution network services, and big data and artificial intelligence platforms, but is not limited to these.
目前资讯侧使用个股名称完全匹配的方式进行关联个股,该方式只能在个股名称完整出现时才能匹配到,分发效率不高,同时错误关联的情况较多。目前使用的个股名称完全匹配的方式有以下缺点:Currently, the information side uses the method of fully matching the stock names to associate stocks. This method can only match when the stock names appear completely, which is not efficient and there are many cases of incorrect association. The current method of fully matching the stock names has the following disadvantages:
1.只能在个股名称完整出现时才能关联到个股,例如:资讯中出现“A公司控股”时可以关联到个股00700.HK,但仅出现“A公司”时,无法关联到个股00700.HK。1. The information can only be linked to the stock when the stock name appears in full. For example, when "Company A Holdings" appears in the information, it can be linked to the stock 00700.HK, but when only "Company A" appears, it cannot be linked to the stock 00700.HK.
2.未直接出现个股名称,但大量提及公司产品、高管人名等代表性实体时,无法关联到个股,缺乏关联信息的推理能力。2. When the names of individual stocks do not appear directly, but representative entities such as company products and names of senior executives are mentioned in large numbers, they cannot be linked to individual stocks and lack the ability to infer related information.
3.缺乏上下文语义理解,在个股名称有歧义时会出现错误关联,例如:个股002291.SZ的个股名称为“星期六”,若资讯中出现了“将于星期六举行会谈”,则会错误关联到该个股。3. Lack of contextual semantic understanding will result in incorrect association when the stock name is ambiguous. For example, the stock name of 002291.SZ is "Saturday". If "a meeting will be held on Saturday" appears in the information, it will be incorrectly associated with the stock.
因此,需要设计一种更加智能的信息关联方法,本申请实施例在资讯入库后,可以通过***审核和智能关联股票,可实现资讯自动化分发至个股资讯列表下。关联个股能力可大幅提升,也可以提升资讯分发效率,也可满足用户及时浏览个股相关资讯的诉求。相对于个股名称完全匹配的情况,本申请实施例得到了更进一步的优化,例如,可以将资讯中的公司高管、公司产品、公司业务、公司所在行业等非公司名称的代表性实体同样关联到个股,还可以在后台运营***人工配置新关联链路或激活已存在关联链路,以提升资讯与个股的关联准确度。Therefore, it is necessary to design a more intelligent information association method. After the information is stored in the database, the embodiment of the present application can be automatically distributed to the individual stock information list through system review and intelligent association of stocks. The ability to associate individual stocks can be greatly improved, the efficiency of information distribution can be improved, and the user's demand for timely browsing of individual stock related information can be met. Relative to the situation where the individual stock names are completely matched, the embodiment of the present application has been further optimized. For example, representative entities other than company names such as company executives, company products, company businesses, and the industry in which the company is located in the information can also be associated with individual stocks. New association links can also be manually configured in the background operation system or existing association links can be activated to improve the accuracy of the association between information and individual stocks.
以下分别进行详细说明。需说明的是,以下实施例的描述顺序不作为对实施例优先顺序的限定。It should be noted that the order of description of the following embodiments is not intended to limit the priority order of the embodiments.
请参阅图1至图2,图1为本申请实施例提供的信息关联方法的流程示意图,图2为本申请实施例提供的应用场景示意图。本申请实施例的信息关联方法可应用于服务器。该方法包括以下步骤:Please refer to Figures 1 and 2. Figure 1 is a flow chart of the information association method provided in the embodiment of the present application, and Figure 2 is a schematic diagram of the application scenario provided in the embodiment of the present application. The information association method of the embodiment of the present application can be applied to a server. The method comprises the following steps:
步骤110,提取资讯集中每篇资讯的实体信息,所述实体信息包括至少一个实体。Step 110, extracting entity information of each piece of information in the information set, wherein the entity information includes at least one entity.
在一些实施例中,在所述提取资讯集中每篇资讯的实体信息之前,还包括:In some embodiments, before extracting entity information of each piece of information in the information set, the method further includes:
获取个股集中每只个股对应的个股资讯列表,每只个股对应的个股资讯列表中存储有至少一篇具有初始关联关系的资讯;Obtain a list of individual stock information corresponding to each individual stock in the individual stock set, wherein the list of individual stock information corresponding to each individual stock stores at least one piece of information having an initial association relationship;
根据所述个股资讯列表,获取资讯集,所述资讯集包含所有个股资讯列表中的所有资讯。 According to the individual stock information list, an information set is obtained, where the information set includes all the information in all the individual stock information lists.
例如,在该信息关联方法对应的***构建前,每只个股已经积累了一定量的个股资讯新闻,且在相关客户端上将个股资讯新闻提供给用户浏览,对于一些热门的关键个股,还有专门的运营人员进行日常的资讯维护。可以理解的,每只个股对应的个股资讯列表下存储的所有资讯,都和该个股存在一定的相关性(比如强相关性或弱相关性,至少不是完全无关的)。因此,可以利用个股集中每只个股对应的个股资讯列表中已有的海量资讯数据作为先验知识,构建资讯集。经过对资讯集中的每篇资讯进行实体抽取后,得到的实体抽取结果包括资讯集中每篇资讯的实体信息,可以初步构建起资讯-实体网络。For example, before the system corresponding to the information association method is built, each stock has accumulated a certain amount of stock information news, and the stock information news is provided to users for browsing on the relevant client. For some popular key stocks, there are also dedicated operators to perform daily information maintenance. It can be understood that all the information stored in the stock information list corresponding to each stock has a certain correlation with the stock (such as strong correlation or weak correlation, at least not completely irrelevant). Therefore, the massive information data already in the stock information list corresponding to each stock in the stock set can be used as prior knowledge to construct an information set. After entity extraction of each piece of information in the information set, the entity extraction result obtained includes the entity information of each piece of information in the information set, and an information-entity network can be preliminarily constructed.
例如,在实际线上使用时,在提取资讯集中每篇资讯的实体信息时,可以分为2种情形:For example, in actual online use, when extracting entity information of each piece of information in an information set, there are two situations:
1)针对资讯集中的存量个股资讯,可以一次性开启多线程并发请求实体抽取接口,以实现在短时间内将存量的百万级别个股资讯的实体抽取结果存入数据库内。1) For the existing stock information in the information set, you can open a multi-threaded concurrent request entity extraction interface at one time to store the entity extraction results of the existing millions of stock information into the database in a short time.
2)针对资讯集中上线后不断流入的增量个股资讯,可以设置定时任务,分别在一天内的多个时间点,从Kafka消息队列中获取时间段内新增的个股资讯,并批量进行实体抽取,实体抽取结果同样存入数据库。2) For the incremental stock information that continues to flow in after the information is centralized and launched, a scheduled task can be set to obtain the newly added stock information within the time period from the Kafka message queue at multiple time points within a day, and perform entity extraction in batches. The entity extraction results are also stored in the database.
其中,Kafka是一个分布式消息队列,具有高性能、持久化、多副本备份、横向扩展能力。在架构中起到解偶、削峰、异步处理的作用。Kafka的最大的特性就是可以实时的处理大量数据以满足各种需求场景。获取资讯。Among them, Kafka is a distributed message queue with high performance, persistence, multi-copy backup, and horizontal expansion capabilities. It plays the role of decoupling, peak shaving, and asynchronous processing in the architecture. The biggest feature of Kafka is that it can process large amounts of data in real time to meet various demand scenarios. Get information.
例如,一篇资讯的获取方式可以包括内容源爬取自动审核入库或者人工新建入库。例如,基于资讯源爬取资讯,并将资讯入库至数据库中。例如,在与资讯原网站达成版权合作的基础上,基于爬虫工具从资讯源爬取资讯。该爬虫工具是一种按照必定的规则,自动地抓取万维网信息的程序或者脚本。爬虫工具经过HTTP库向目标站点发起请求,即发送一个Request,请求能够包含额外的headers等信息,等待服务器响应;若是服务器能正常响应,会获得一个Response,Response的内容即是所要获取的页面内容,类型可能有HTML,Json字符串,二进制数据(如图片视频)等类型;获得的内容可以是HTML,能够用正则表达式、网页解析库进行解析;获取的内容也可以是Json,能够直接转为Json对象解析,一般为二进制数据,能够作保存或者进一步的处理;爬虫工具爬取的资讯能够存为文本,也能够保存至数据库,或者保存特定格式的文件。例如,也可以对资讯进行人工新建入库,响应于人工审核平台发送的针对资讯的入库请求,获取资讯,并将资讯入库至数据库中。For example, the acquisition method of an article of information may include crawling the content source, automatically reviewing and storing it in the database, or manually creating and storing it in the database. For example, crawling information based on the information source and storing it in the database. For example, based on the copyright cooperation with the original website of the information, crawling information from the information source based on the crawler tool. The crawler tool is a program or script that automatically crawls World Wide Web information according to certain rules. The crawler tool initiates a request to the target site through the HTTP library, that is, sending a Request, which can contain additional headers and other information, and waits for the server to respond; if the server can respond normally, a Response will be obtained, and the content of the Response is the content of the page to be obtained, which may be of HTML, Json string, binary data (such as pictures and videos), etc.; the obtained content can be HTML, which can be parsed using regular expressions and web page parsing libraries; the obtained content can also be Json, which can be directly converted to Json object parsing, generally binary data, which can be saved or further processed; the information crawled by the crawler tool can be saved as text, can also be saved in a database, or saved in a file of a specific format. For example, the information may also be manually created and stored in the database, in response to a storage request for the information sent by the manual review platform, the information may be obtained and stored in the database.
例如,在获取资讯之后,资讯入库之前,还可以基于预设审核规则对每篇资讯进行资讯审核,其中,预设审核规则至少包括敏感词匹配与过滤规则校验;若资讯未命中敏感词,且资讯未命中过滤规则,则将资讯入库至数据库中。若资讯命中敏感词,和/或资讯命中过滤规则,则可以生成审核不通过的第一提示信息,并将资讯与第一提示信息发送至人工审核平台,以使用户确定是否将该资讯入库。For example, after obtaining the information and before the information is stored in the database, each piece of information can be reviewed based on the preset review rules, wherein the preset review rules at least include sensitive word matching and filtering rule verification; if the information does not hit the sensitive word and the information does not hit the filtering rule, the information is stored in the database. If the information hits the sensitive word and/or the information hits the filtering rule, a first prompt message indicating that the review has not passed can be generated, and the information and the first prompt message can be sent to the manual review platform to allow the user to determine whether to store the information.
其中,敏感词库和过滤规则词库,是预选搭建的词库,在资讯入库时通过对资讯的标题、资讯来源、正文等字段进行文本匹配,若命中敏感词或命中过滤词,则自动判断为审核不通过,需要运营人员再进行人工审核后才能确定是否将资讯入库。例如,可以响应于针对已入库的资讯的维护指令,将对应资讯存入对应个股的个股资讯列表下。Among them, the sensitive word library and the filtering rule word library are pre-selected word libraries. When information is stored, the title, source, text and other fields of the information are matched. If a sensitive word or a filtering word is hit, it is automatically judged as unapproved. The operator needs to conduct manual review before deciding whether to store the information. For example, in response to a maintenance instruction for stored information, the corresponding information can be stored in the stock information list of the corresponding stock.
在一些实施例中,所述提取资讯集中每篇资讯的实体信息,包括:In some embodiments, extracting entity information of each piece of information in the information set includes:
获取所述资讯集中每篇资讯的资讯文本数据;Obtaining information text data of each piece of information in the information set;
基于实体抽取模型对每篇资讯的资讯文本数据进行处理,得到每篇资讯的实体信息,其中,所述实体抽取模型用于抽取所述资讯文本数据中的预设实体。The information text data of each piece of information is processed based on an entity extraction model to obtain entity information of each piece of information, wherein the entity extraction model is used to extract preset entities in the information text data.
例如,实体抽取模型具有可被调用的接口调用,可以直接通过对应接口调用实体抽取模型来提取资讯集中每篇资讯的实体信息。For example, the entity extraction model has a callable interface call, and the entity extraction model can be directly called through the corresponding interface to extract the entity information of each piece of information in the information set.
例如,资讯的核心是报道新闻事件,新闻事件表述了一个或多个实体(比如公司、人、国家、组织机构等)发生的事件。为了准确关联到个股,实体抽取模型需准确提取资讯中出现的预设实体,预设实体包括但不限于:公司、人名、产品名称、业务、行业、国家、组织机构等。For example, the core of information is to report news events, which describe events that occur in one or more entities (such as companies, people, countries, organizations, etc.). In order to accurately associate with individual stocks, the entity extraction model needs to accurately extract the preset entities that appear in the information, including but not limited to: company, person name, product name, business, industry, country, organization, etc.
例如,资讯a包括的内容为“A汽车公司创始人兼CEO李某官宣XX型汽车将于X月X日发布”,实体抽取模型提取的资讯a的实体信息可以包括以下实体:公司名为“A汽车公司”;人名为“李某”,产品名称为“XX型汽车”。For example, information a includes the content "Li, founder and CEO of A Automobile Company, officially announced that XX type car will be released on X month X day". The entity information of information a extracted by the entity extraction model may include the following entities: the company name is "A Automobile Company"; the person name is "Li", and the product name is "XX type car".
基于以上提取到的信息,***再来确定此处出现的每个实体分别和所有个股的关联度,再由核心算法综合以上所有实体到个股的关联关系得出最终关联结果。因此,实体抽取模块是关联个股整体架构的第一步。Based on the information extracted above, the system determines the correlation between each entity and all stocks, and then the core algorithm combines the correlation between all entities and stocks to get the final correlation result. Therefore, the entity extraction module is the first step in the overall architecture of associating stocks.
例如,实体抽取模型主要针对金融领域关心的公司、人名、产品名称、业务、行业、国家、组织机构等这些实体类型进行数据标注。For example, the entity extraction model mainly performs data labeling on entity types such as companies, names of people, product names, businesses, industries, countries, and organizations that are of concern to the financial field.
在收集到相应的标注数据后,可以利用自然语言处理预训练模型Bert和全局指针模块,搭建初始的实体抽取模型,利用已标注有实体类型的标注数据,初始的实体抽取模型进行微调,最终得到实体抽取模型。After collecting the corresponding labeled data, you can use the natural language processing pre-trained model Bert and the global pointer module to build an initial entity extraction model. Using the labeled data that has been labeled with entity types, the initial entity extraction model is fine-tuned to finally obtain the entity extraction model.
其中,Bert模型由嵌入(Embedding)层、12个变换(Transformer)层搭建而成,共有1.1亿参数,模型参数非常庞大。在本申请实施例中,Bert模型作为text_encoder对输入的资讯文本数据进行特征提取。输入的资讯文本数据首先会经过Bert中的分词器(Tokenizer),得到长度为L的标记(tokens)序列,tokens序列进一步根据vocab中的映射关系将token文本转成word id,得到[1,L]的输入张量(tensor),接着输入到Bert模型中,Bert作为encoder,[1,L]的tensor经过Embedding层,得到[L,D]维度的tensor,该tensor记为R(由于是Bert模型,则D=768),接着,该tensor输入全指针层,输出一个[n_labels,L,L],n_labels是实体总类别数,例如:对于同时抽取公司、人名、产品名称、业务、行业、国家、组织机构的实体抽取模型,n_labels=7。The Bert model is built from an embedding layer and 12 transformer layers, with a total of 110 million parameters, and the model parameters are very large. In the embodiment of the present application, the Bert model is used as a text_encoder to extract features from the input information text data. The input information text data will first pass through the Tokenizer in Bert to obtain a token sequence of length L. The token sequence will further convert the token text into word id according to the mapping relationship in vocab to obtain an input tensor of [1,L], which is then input into the Bert model. Bert acts as an encoder, and the [1,L] tensor passes through the Embedding layer to obtain a tensor of [L,D] dimensions, which is denoted as R (because it is a Bert model, D=768). Then, the tensor is input into the full pointer layer and outputs a [n_labels,L,L], where n_labels is the total number of entity categories. For example, for an entity extraction model that simultaneously extracts company names, person names, product names, businesses, industries, countries, and organizations, n_labels=7.
全指针层的作用是利用text_encoder提取出实体的丰富语义信息,一次性通过一个指针方阵来指示出实体的头和尾,从而能快速定位到该实体在原文中的位置,进行直接提取。本实施例中可以采用一个简化版的多头注意力(Multi-Head Attention)模块来实现该功能。Multi-Head Attention模块由三个矩阵Q(查询)、K(所有键)、V(值)进行矩阵计算后,再进行Scaled Dot-Product Attention计算。此处是直接利用Q和K矩阵(均为[D,d]的矩阵),以及上述得到的[L,D]维度的tensor(该tensor记为R),对输入的[L,D]降维到[L,d](d=64通常<<D)的特征空间中,记为q和k,相关公式如下:The role of the full pointer layer is to use text_encoder to extract the rich semantic information of the entity, and to indicate the head and tail of the entity at one time through a pointer matrix, so as to quickly locate the position of the entity in the original text for direct extraction. In this embodiment, a simplified version of the Multi-Head Attention module can be used to implement this function. The Multi-Head Attention module performs matrix calculations on three matrices Q (query), K (all keys), and V (values), and then performs Scaled Dot-Product Attention calculations. Here, the Q and K matrices (both are [D, d] matrices) and the [L, D] dimensional tensor obtained above (the tensor is denoted as R) are directly used to reduce the input [L, D] to the feature space of [L, d] (d = 64, usually << D), denoted as q and k, and the relevant formulas are as follows:
q=R·Q;k=R·K; q = R·Q; k = R·K;
其中,Sα(i,j)表示第α类实体的指针方阵,其形状为[L,L],当有n_labels类时,每个实体类别都将计算得到这样一个指针方阵,因此整个全指针层的输出为[n_labels,L,L]的tensor。需要注意的是,Sα的行表示实体头位置,列表示实体尾位置,因此Sα虽然是方阵,但只有上三角部分有实际意义,下三角的输出直接不考虑。Among them, S α (i,j) represents the pointer matrix of the αth entity, and its shape is [L,L]. When there are n_labels classes, each entity category will calculate such a pointer matrix, so the output of the entire full pointer layer is a tensor of [n_labels,L,L]. It should be noted that the rows of S α represent the entity head position and the columns represent the entity tail position. Therefore, although S α is a square matrix, only the upper triangle part has practical significance, and the output of the lower triangle is directly ignored.
分类输出层的作用是将输出的指针方阵进行实体提取,[n_labels,L,L]中大于0的值认为是被激活的实体头尾,因此该层将模型输出的logits转化成0/1二值方阵,被激活的实体头尾被置为1,其余为0。The function of the classification output layer is to extract entities from the output pointer matrix. Values greater than 0 in [n_labels,L,L] are considered to be activated entity heads and tails. Therefore, this layer converts the logits output by the model into a 0/1 binary matrix. The activated entity heads and tails are set to 1, and the rest are set to 0.
例如,在训练阶段,对获取的样本资讯的资讯文本数据进行人工标注,比如,通过针对样本资讯的资讯文本数据中的公司、人名、产品名称、业务、行业、国家、组织机构等实体类型进行数据标注,得到标注数据。将标注数据输入预设算法模型中进行模型训练,比如采用标注数据和Bert模型预训练多个transformers层、全局指针层和分类输出层,得到实体抽取模型。在应用阶段,将资讯集中每篇资讯的资讯文本数据输入实体抽取模型中,以抽取每篇资讯的实体信息。For example, in the training phase, the information text data of the sample information obtained is manually annotated, for example, by annotating the entity types such as company, person name, product name, business, industry, country, organization, etc. in the information text data of the sample information to obtain the annotated data. The annotated data is input into the preset algorithm model for model training, such as using the annotated data and the Bert model to pre-train multiple transformer layers, global pointer layers, and classification output layers to obtain an entity extraction model. In the application phase, the information text data of each piece of information in the information set is input into the entity extraction model to extract the entity information of each piece of information.
步骤120,根据个股集、资讯集以及所述资讯集中每篇资讯的实体信息,确定全局统计关系信息。Step 120, determining global statistical relationship information based on the stock set, the information set, and the entity information of each piece of information in the information set.
例如,步骤120可以基于Spark计算任务***来实现。Spark是基于内存计算的大数据并行计算框架。Spark基于内存计算的特性,提高了在大数据环境下数据处理的实时性,同时保证了高容错性和高可伸缩性,允许用户将Spark部署在大量的廉价硬件之上形成集群,提高了并行计算能力。For example, step 120 can be implemented based on the Spark computing task system. Spark is a big data parallel computing framework based on memory computing. Spark improves the real-time performance of data processing in a big data environment based on the characteristics of memory computing, while ensuring high fault tolerance and high scalability, allowing users to deploy Spark on a large number of cheap hardware to form a cluster, thereby improving parallel computing capabilities.
该Spark计算任务***可以负责将个股、海量的个股资讯、及其所提取到的千万级别的实体,基于MapReduce思想,进行一定的转换、聚合计算,得到不同类别对象之间的全局统计关系信息,该全局统计关系信息可以表征个股与实体之间的关系、实体与实体之间的关系等。The Spark computing task system can be responsible for converting and aggregating individual stocks, massive amounts of individual stock information, and tens of millions of entities extracted from them, based on the MapReduce concept, to obtain global statistical relationship information between objects of different categories. The global statistical relationship information can represent the relationship between individual stocks and entities, the relationship between entities, etc.
其中,MapReduce是一种编程模型,用于大规模数据集(大于1TB)的并行运算。 Among them, MapReduce is a programming model used for parallel computing of large-scale data sets (greater than 1TB).
其中,全局统计关系信息包括不同实体与不同个股间的第一共现关系、不同实体间的第二共现关系、以及不同实体与不同个股间的第一关联度;例如,全局统计关系信息可以包括:个股-实体的第一共现关系,实体-实体的第二共现关系,实体的全局IDF值(逆文档频率,Inverse Document Frequency,缩写为IDF),个股-实体的MF值(实体频率,Mention Frequency,缩写为MF),MFIDF值(实体频率-逆文档频率)等。Among them, the global statistical relationship information includes the first co-occurrence relationship between different entities and different stocks, the second co-occurrence relationship between different entities, and the first correlation between different entities and different stocks; for example, the global statistical relationship information may include: the first co-occurrence relationship between stocks and entities, the second co-occurrence relationship between entities and entities, the global IDF value (Inverse Document Frequency, abbreviated as IDF) of entities, the MF value (entity frequency, Mention Frequency, abbreviated as MF) of stocks and entities, the MFIDF value (entity frequency-inverse document frequency), etc.
例如,在本实施例中所有的实体都用单词“mention”表示,而不是“entity”。For example, in this embodiment, all entities are represented by the word "mention" instead of "entity".
在一些实施例中,步骤120可以通过步骤121至步骤123实现(图中未示出),具体为:In some embodiments, step 120 may be implemented by steps 121 to 123 (not shown in the figure), specifically:
步骤121,根据所述个股集、所述资讯集以及所述资讯集中每篇资讯的实体信息,确定每篇资讯中各个实体与个股集中每只个股的第一共现关系。Step 121, determining the first co-occurrence relationship between each entity in each piece of information and each stock in the stock set according to the stock set, the information set and the entity information of each piece of information in the information set.
在一些实施例中,所述根据所述个股集、所述资讯集以及所述资讯集中每篇资讯的实体信息,确定每篇资讯中各个实体与个股集中每只个股的第一共现关系,包括:In some embodiments, determining the first co-occurrence relationship between each entity in each piece of information and each stock in the stock set based on the stock set, the information set, and the entity information of each piece of information in the information set includes:
根据所述个股集中每只个股对应的个股资讯列表和所述资讯集,确定所述资讯集中每篇资讯对应的具有初始关联关系的个股;According to the individual stock information list corresponding to each individual stock in the individual stock set and the information set, determining the individual stock with an initial association relationship corresponding to each piece of information in the information set;
根据所述资讯集中每篇资讯对应的具有初始关联关系的个股,以及所述资讯集中每篇资讯的实体信息,确定每篇资讯中各个实体与个股集中每只个股的第一共现关系。According to the stocks with initial association relationships corresponding to each piece of information in the information set and the entity information of each piece of information in the information set, the first co-occurrence relationship between each entity in each piece of information and each stock in the stock set is determined.
例如,个股-实体的第一共现关系:针对资讯集中的资讯a,根据个股资讯列表和资讯集,确定出资讯a关联有n只个股,从资讯a中可以提取到m个实体,则对应于资讯a,会产生n×m个“个股-实体”对,个股和实体两两配对(比如两两配对表示二者共同出现)时认为是共现一次,因此将每一配对的“个股-实体”对的频次加1,最终可以得到某只个股和某个实体的全局共现总次数,遍历全部“个股-实体”对,得到每篇资讯中各个实体与个股集中每只个股的第一共现关系。该第一共现关系可以表征实体与个股之间的相关程度,该第一共现关系有利于用户了解与某只个股最相关的实体是哪些,例如,个股00700.HK经常与实体“王者荣耀”共现,说明两者之间的相关程度高。For example, the first co-occurrence relationship between stocks and entities: for information a in the information set, according to the stock information list and the information set, it is determined that information a is associated with n stocks, and m entities can be extracted from information a. Then, corresponding to information a, n×m "stock-entity" pairs will be generated. When stocks and entities are paired in pairs (for example, pairing in pairs means that the two appear together), they are considered to co-occur once. Therefore, the frequency of each paired "stock-entity" pair is added by 1, and finally the total number of global co-occurrences of a certain stock and a certain entity can be obtained. All "stock-entity" pairs are traversed to obtain the first co-occurrence relationship between each entity in each piece of information and each stock in the stock set. The first co-occurrence relationship can represent the degree of correlation between entities and stocks. The first co-occurrence relationship is helpful for users to understand which entities are most relevant to a certain stock. For example, the stock 00700.HK often co-occurs with the entity "King of Glory", indicating that the correlation between the two is high.
步骤122,根据所述资讯集中每篇资讯的实体信息,确定每篇资讯中各个实体间的第二共现关系。Step 122, determining a second co-occurrence relationship between entities in each piece of information based on the entity information of each piece of information in the information set.
例如,实体-实体的第二共现关系:类似的,针对资讯集中的资讯a,比如从资讯a中可以提取到m个实体,资讯a中的不同实体间两两配对,可以组成m×m个“实体-实体”对,实体和实体两两配对(比如两两配对表示二者共同出现)时认为是共现一次,最终可以得到某个实体与另一个实体的全局共现总次数,遍历全部“实体-实体”对,得到每篇资讯中各个实体间的第二共现关系。该第二共现关系可以表征实体之间的相关程度,该第二共现关系有利于我们了解实体之间的相关程度,例如,实体“微信”经常与实体“公众号”共现,说明两者之间的相关程度高。For example, the second co-occurrence relationship between entities: Similarly, for information a in the information set, for example, m entities can be extracted from information a, and different entities in information a are paired with each other to form m×m "entity-entity" pairs. When entities are paired with each other (for example, pairing with each other means that the two appear together), they are considered to co-occur once. Finally, the total number of global co-occurrences of a certain entity with another entity can be obtained. All "entity-entity" pairs are traversed to obtain the second co-occurrence relationship between each entity in each piece of information. This second co-occurrence relationship can represent the degree of correlation between entities. This second co-occurrence relationship helps us understand the degree of correlation between entities. For example, the entity "WeChat" often co-occurs with the entity "official account", indicating that the correlation between the two is high.
步骤123,根据所述个股集、所述资讯集以及所述资讯集中每篇资讯的实体信息, 确定所述个股集中每只个股与每篇资讯中各个实体的第一关联度。Step 123, based on the stock set, the information set and the entity information of each piece of information in the information set, Determine a first correlation between each stock in the stock set and each entity in each piece of information.
在一些实施例中,所述根据所述个股集、所述资讯集以及所述资讯集中每篇资讯的实体信息,确定所述个股集中每只个股与每篇资讯中各个实体与的第一关联度,包括:In some embodiments, determining the first degree of association between each stock in the stock set and each entity in each piece of information based on the stock set, the information set, and the entity information of each piece of information in the information set includes:
根据所述个股集中每只个股对应的个股资讯列表、所述资讯集以及所述资讯集中每篇资讯的实体信息,确定每篇资讯中第i个实体与所述个股集中第j只个股的共现资讯篇数,确定所述第j只个股对应的资讯总篇数,确定所述资讯集的资讯总篇数,以及确定所述资讯集中出现过所述第i个实体的资讯总篇数;According to the list of individual stock information corresponding to each individual stock in the individual stock set, the information set and the entity information of each piece of information in the information set, determine the number of co-occurrence pieces of information of the i-th entity in each piece of information and the j-th stock in the individual stock set, determine the total number of pieces of information corresponding to the j-th stock, determine the total number of pieces of information in the information set, and determine the total number of pieces of information in which the i-th entity appears in the information set;
根据每篇资讯中第i个实体与所述个股集中第j只个股的共现资讯篇数,以及所述第j只个股对应的资讯总篇数,确定所述第j只个股相对于所述第i个实体的实体频率;Determine the entity frequency of the jth stock relative to the i-th entity according to the number of co-occurring information articles of the i-th entity and the j-th stock in the stock set in each information article, and the total number of information articles corresponding to the j-th stock;
根据所述资讯集的资讯总篇数,以及所述资讯集中出现过所述第i个实体的资讯总篇数,确定所述第i个实体的逆文档频率;Determine the inverse document frequency of the i-th entity according to the total number of information pieces in the information set and the total number of information pieces in which the i-th entity appears in the information set;
根据所述实体频率与所述逆文档频率的积,确定所述第j只个股与所述第i个实体的第一关联度;Determining a first correlation degree between the j-th stock and the i-th entity according to the product of the entity frequency and the inverse document frequency;
遍历所述资讯集中每篇资讯的各个实体,确定所述个股集中每只个股与每篇资讯中各个实体与的第一关联度。Each entity in each piece of information in the information set is traversed to determine a first correlation between each stock in the stock set and each entity in each piece of information.
例如,可以参考经典的TFIDF算法,提出MFIDF算法,以找出目标个股的“独特且相关”的所有实体mentions列表。For example, we can refer to the classic TFIDF algorithm and propose an MFIDF algorithm to find a list of all “unique and relevant” entity mentions of a target stock.
例如,根据每篇资讯中第i个实体与个股集中第j只个股的共现资讯篇数,以及第j只个股对应的资讯总篇数,确定第j只个股相对于第i个实体的实体频率MFij,可以表示为如下公式(1):
For example, according to the number of co-occurrence information between the i-th entity and the j-th stock in the stock set in each information, and the total number of information corresponding to the j-th stock, the entity frequency MF ij of the j-th stock relative to the i-th entity can be determined, which can be expressed as the following formula (1):
例如,根据资讯集的资讯总篇数,以及资讯集中出现过第i个实体的资讯总篇数,确定第i个实体的逆文档频率IDFi,可以表示为如下公式(2):
For example, according to the total number of information pieces in the information set and the total number of information pieces in which the i-th entity appears in the information set, the inverse document frequency IDF i of the i-th entity is determined, which can be expressed as the following formula (2):
其中,分母中,出现过实体i的资讯总篇数加1,是为了避免分母为0(即所有资讯都不包含该实体的情况);log表示对得到的值取对数。In the denominator, the total number of pieces of information in which entity i appears is added by 1 to avoid the denominator being 0 (i.e., all information does not contain the entity); log means taking the logarithm of the obtained value.
例如,根据实体频率MFij与逆文档频率IDFi的积,确定第j只个股与第i个实体的第一关联度MFIDFij,可以表示为如下公式(3):
MFIDFij=MFij×IDFi(3)。
For example, according to the product of the entity frequency MF ij and the inverse document frequency IDF i , the first correlation degree MFIDF ij between the j-th stock and the i-th entity is determined, which can be expressed as the following formula (3):
MFIDF ij =MF ij ×IDF i (3).
例如,基于该Spark计算任务***,以及基于对个股集、资讯集以及资讯集中每篇资讯的实体信息进行处理,得到Spark任务结果,该Spark任务结果可以包括全局统计关系信息,然后将得到的Spark任务结果直接写入到Hive表。例如,针对资讯集中的所有第一共现关系(包含每篇资讯对应的个股-实体的第一共现关系)生成1张Hive表;针对资讯集中的所有第二共现关系(包含每篇资讯对应的实体-实体的第二共现关系)生成1张Hive表;针对个股集中的所有第一关联度(包含每只个股与每篇资讯中各个实体的第一关联度)可以对应多张Hive表,其中一张Hive表用于存储MF值,,另一张Hive表用于存储IDF值,又一张Hive表用于存储MFIDF值。由于Hive表的查询速度较慢,为了加速查询,可以将Hive表同步到MySQL表,以利用MySQL的索引快速的查询数据。该存储有Spark任务结果的Hive表或者对应的MySQL表可以记为stock_mention表。For example, based on the Spark computing task system, and based on processing the entity information of the stock set, the information set, and each piece of information in the information set, a Spark task result is obtained, which may include global statistical relationship information, and then the obtained Spark task result is directly written into the Hive table. For example, a Hive table is generated for all first co-occurrence relationships in the information set (including the first co-occurrence relationship between the stock and the entity corresponding to each piece of information); a Hive table is generated for all second co-occurrence relationships in the information set (including the second co-occurrence relationship between the entity and the entity corresponding to each piece of information); and multiple Hive tables may correspond to all first correlations in the stock set (including the first correlation between each stock and each entity in each piece of information), one of which is used to store MF values, another Hive table is used to store IDF values, and another Hive table is used to store MFIDF values. Since the query speed of the Hive table is slow, in order to speed up the query, the Hive table can be synchronized to the MySQL table to use the MySQL index to quickly query data. The Hive table or the corresponding MySQL table storing the Spark task results may be recorded as a stock_mention table.
步骤130,根据所述全局统计关系信息,确定所述资讯集中每篇资讯对应的关联个股,所述关联个股表示与对应的资讯相关的个股。Step 130, determining the associated stocks corresponding to each piece of information in the information set according to the global statistical relationship information, wherein the associated stocks represent stocks related to the corresponding information.
在一些实施例中,所述根据所述全局统计关系信息,确定所述资讯集中每篇资讯对应的关联个股,包括:In some embodiments, determining the associated stocks corresponding to each piece of information in the information set according to the global statistical relationship information includes:
根据所述第一共现关系与所述第二共现关系,确定所述个股集中每只个股对应的共现实体列表,所述共现实体列表中的实体为与对应的个股共现过的实体;Determine, according to the first co-occurrence relationship and the second co-occurrence relationship, a co-occurrence entity list corresponding to each individual stock in the individual stock set, wherein the entities in the co-occurrence entity list are entities that have co-occurred with the corresponding individual stock;
根据所述第一关联度对所述共现实体列表中的实体进行排列,将排列后的共现实体列表位于前N位实体,确定为所述个股集中每只个股对应的候选实体;Arrange the entities in the co-existing entity list according to the first association degree, and determine the entities in the top N positions of the arranged co-existing entity list as candidate entities corresponding to each stock in the stock set;
获取每只个股与对应的各个候选实体之间的关联链路标识;Obtain the association link identifier between each stock and each corresponding candidate entity;
当目标个股与对应的各个候选实体之间的关联链路标识满足预设条件,且所述目标个股对应的各个候选实体均属于目标资讯时,将所述目标个股确定为所述目标资讯的关联个股;When the association link identifiers between the target stock and the corresponding candidate entities meet the preset conditions, and the candidate entities corresponding to the target stock all belong to the target information, the target stock is determined as an associated stock of the target information;
其中,所述目标个股为所述个股集中的任一个股,所述目标资讯为所述资讯集中的任一资讯。The target stock is any stock in the stock set, and the target information is any information in the information set.
例如,可以利用stock_mention表中存储的结果确定资讯集中每篇资讯对应的关联个股。例如,可以根据stock_mention表中的第一共现关系与第二共现关系确定个股集中每只个股对应的共现实体列表,比如给定任意一只个股A,可以返回与个股A共现过的所有实体,得到个股A对应的共现实体列表,且可以根据表征第一关联度的MFIDF值的降序排列并取TopN,初步得到该个股A最相关的实体列表(该最相关的实体列表包含个股A对应的候选实体);遍历个股集中的各只个股,确定个股集中每只个股对应的候选实体。但是,由于海量资讯中存在一定的噪声,且MFIDF类似于TFIDF,其精度不会很高,无法直接将个股A的TopN实体不加审核的直接对外使用。因此,可以增加一个人工审核平台,对每只个股(比如超过2万只的个股)的最相关实体,进行逐一检查。具体的,将个股集中每只个股对应的候选实体发送至人工审核平台,响应于在人工审核平台输入的审核操作,生成每只个股与对应的各个候选实体之间的关联链路标识并返回至服务器,使得服务器获取每只个股与对应的各个候选实体之间的关联链路标识,当目标个股与对应的各个候选实体之间的关联链路标识满足预设条件,且目标个股对应的各个候选实体均属于目标资讯时,将目标个股确定为目标资讯的关联个股。例如,个股与实体的关联链路默认为关闭,即:关联链路标识active=0,若需要将该个股与实体关联,则对该关联链路进行人工激活,修改关联链路标识active为1。经过大数据统计后,已存储的个股-实体关系能覆盖绝大部分情形。但不可避免有一些关联链路的现存数据未覆盖到,运营人员可以手动新增一些个股与实体的关联链路。本申请实施例可以提供新增一条或多条个股与实体的关联链路的能力,可以在人工审核平台非常方便的进行新增、删除、激活、不激活等操作,修改后的关联链路可以及时反映到线上服务的关联结果中。在在人工审核平台操作生成的人工操作表是建立在stock_mention表的基础数据之上,增加个股-实体对的人工运营记录,记为ops_stock_mention表,该ops_stock_mention表包含每只个股与对应的各个候选实体之间的关联链路标识。For example, the results stored in the stock_mention table can be used to determine the associated stocks corresponding to each piece of information in the information set. For example, the co-occurrence entity list corresponding to each stock in the stock set can be determined based on the first co-occurrence relationship and the second co-occurrence relationship in the stock_mention table. For example, given any stock A, all entities that have co-occurred with stock A can be returned to obtain the co-occurrence entity list corresponding to stock A, and the most relevant entity list of stock A can be preliminarily obtained according to the descending order of the MFIDF value representing the first correlation and the TopN can be taken; the most relevant entity list of stock A can be traversed to determine the candidate entity corresponding to each stock in the stock set. However, due to the presence of certain noise in the massive information, and the fact that MFIDF is similar to TFIDF, its accuracy will not be very high, and the TopN entities of stock A cannot be directly used externally without review. Therefore, a manual review platform can be added to check the most relevant entities of each stock (for example, more than 20,000 stocks) one by one. Specifically, the candidate entity corresponding to each stock in the stock set is sent to the manual review platform. In response to the review operation input on the manual review platform, the association link identifier between each stock and the corresponding candidate entity is generated and returned to the server, so that the server obtains the association link identifier between each stock and the corresponding candidate entity. When the association link identifier between the target stock and the corresponding candidate entity meets the preset conditions, and the candidate entities corresponding to the target stock all belong to the target information, the target stock is determined as the associated stock of the target information. For example, the association link between the stock and the entity is closed by default, that is, the association link identifier active = 0. If the stock needs to be associated with the entity, the association link is manually activated and the association link identifier active is modified to 1. After big data statistics, the stored stock-entity relationship can cover most situations. However, it is inevitable that some existing data of the association link is not covered. The operator can manually add some association links between stocks and entities. The embodiment of the present application can provide the ability to add one or more association links between individual stocks and entities, and can easily perform operations such as adding, deleting, activating, and deactivating on the manual review platform. The modified association link can be reflected in the association results of the online service in a timely manner. The manual operation table generated by the manual review platform operation is based on the basic data of the stock_mention table, and the manual operation record of the individual stock-entity pair is added, which is recorded as the ops_stock_mention table. The ops_stock_mention table contains the association link identifier between each individual stock and the corresponding candidate entity.
在本申请实施例中,将资讯关联到个股的核心在于如何确定某个实体和个股集中的各只个股的关联度,即需要设计一种机制或算法,让资讯中提取的多个实体与个股关联度最高,比如实体包“A汽车公司”、“李某”、“XX型汽车”,与个股b关联度最高,而与其他个股关联度都较低。这样,就实现了将资讯中提取到的实体与个股关联起来,进而实现了资讯关联到个股的目的。例如,可以利用步骤130中获取的人工激活的实体-个股链路做直接关联,例如,出现在某篇资讯中的所有实体,统一查询ops_stock_mention表,查询到某篇资讯中所有实体对应的关联链路标识active=1的关联链路,其对应的目标个股,作为关联个股直接返回。具体的,当目标个股与对应的所有候选实体之间的关联链路标识active=1时,确定目标个股与对应的各个候选实体之间的关联链路标识满足预设条件。当目标个股与对应的各个候选实体之间的关联链路标识满足预设条件,且目标个股对应的各个候选实体均属于目标资讯时,将目标个股确定为所述目标资讯的关联个股。In the embodiment of the present application, the core of associating information with individual stocks lies in how to determine the correlation between a certain entity and each individual stock in the individual stock set, that is, it is necessary to design a mechanism or algorithm so that multiple entities extracted from the information have the highest correlation with individual stocks, such as the entity package "A Automobile Company", "Li", and "XX Model Automobile", which have the highest correlation with individual stock b, while the correlation with other individual stocks is relatively low. In this way, the entities extracted from the information are associated with individual stocks, thereby achieving the purpose of associating information with individual stocks. For example, the manually activated entity-individual stock link obtained in step 130 can be used for direct association. For example, all entities appearing in a certain piece of information can be uniformly queried in the ops_stock_mention table, and the association links corresponding to the association link identifier active=1 corresponding to all entities in a certain piece of information can be queried. The corresponding target individual stocks are directly returned as associated stocks. Specifically, when the association link identifier active=1 between the target individual stock and all corresponding candidate entities is determined to satisfy the preset conditions. When the association link identifiers between the target stock and the corresponding candidate entities meet the preset conditions, and the candidate entities corresponding to the target stock all belong to the target information, the target stock is determined as the associated stock of the target information.
如图2所示,从资讯数据库获取资讯集;然后对资讯集进行实体抽取,以提取资讯集中每篇资讯的实体信息,并将实体信息存储到表1中;然后基于Spark计算任务***的spark定时处理程序进行处理得到Spark任务结果,并将该Spark任务结果写入表2,该表2是Hive表。其中,表3是MySQL表,表2是Hive表,由于Hive不支持索引,查询缓慢,不适合线上查询,因此需要将表2的Spark任务结果同步到MySQL介质进行存储。因此,表2和表3的数据内容是相同的,只是存储介质不同,一个是存在Hive存储介质中,一个是存在MySQL存储介质中。原生的Spark任务结果可以很好地支持写入Hive,当不好写入MySQL,因此需要这个数据同步的操作,将表2的数据同步到表3中。其中,图2中的“目标个股落库”,表示初期上线时,只有热门个股(目标个股)对应的Spark任务结果存储到表3(MySQL表);在后续上线时,可以是将全量个股的全量数据同步到表3(MySQL表)。As shown in Figure 2, an information set is obtained from the information database; then entity extraction is performed on the information set to extract the entity information of each piece of information in the information set, and the entity information is stored in Table 1; then the Spark task result is processed based on the Spark timing processing program of the Spark computing task system, and the Spark task result is written into Table 2, which is a Hive table. Among them, Table 3 is a MySQL table and Table 2 is a Hive table. Since Hive does not support indexes, the query is slow and not suitable for online query, the Spark task result of Table 2 needs to be synchronized to the MySQL medium for storage. Therefore, the data content of Table 2 and Table 3 is the same, but the storage medium is different, one is in the Hive storage medium, and the other is in the MySQL storage medium. The native Spark task result can well support writing to Hive, but it is not easy to write to MySQL, so this data synchronization operation is required to synchronize the data of Table 2 to Table 3. Among them, the "target stocks stored in the database" in Figure 2 means that when the system is initially launched, only the Spark task results corresponding to the popular stocks (target stocks) are stored in Table 3 (MySQL table); when the system is subsequently launched, all the data of all stocks can be synchronized to Table 3 (MySQL table).
其中,操作表中存储的是对表3的数据进行人工审核后保留的关联结果,由于表3涉及的数据行数很大,属于千万级别,不适合做实时同步,因此可以定时对表3的数据进行处理后,将处理结果更新合并至操作表中。The operation table stores the associated results retained after manual review of the data in Table 3. Since the number of data rows involved in Table 3 is very large, at the level of tens of millions, it is not suitable for real-time synchronization. Therefore, the data in Table 3 can be processed regularly and the processing results can be updated and merged into the operation table.
其中,操作表中的所有字段都来自表3,在操作表中相比于表3新增了一个active字段。比如在上一次同步后到下一次更新前的这一段时间内,可能会存在如下情况:1)运营人员有可能对操作表进行了修改,比如有些行的active置1或置0,或者新增了行,人工审核结果中确定某个实体(mention)需要和某只个股关联上;2)针对于表3,在这段时间内也可能新增了数据行、共现次数增加等。3)数据同步时,将表3的所有更新数据合并到操作表中。例如,在实体管理后台的显示界面上,还可以实时呈现不同个股的操作视图,基于运营人员通过实体管理后台对操作表进行修改或新增等操作,并将修改或新增等操作对应的数据更新至操作表中。Among them, all fields in the operation table are from Table 3, and an active field is added to the operation table compared to Table 3. For example, during the period from the last synchronization to the next update, the following situations may exist: 1) The operation staff may have modified the operation table, such as setting the active of some rows to 1 or 0, or adding rows, and the manual review results determine that a certain entity (mention) needs to be associated with a certain stock; 2) For Table 3, new data rows may also be added during this period, the number of co-occurrences may increase, etc. 3) When synchronizing data, all updated data in Table 3 are merged into the operation table. For example, on the display interface of the entity management background, the operation view of different stocks can also be presented in real time, based on the operation staff modifying or adding operations to the operation table through the entity management background, and updating the data corresponding to the modification or addition operations to the operation table.
例如,还可以基于操作表中存储的结果,确定出资讯集中每篇资讯对应的关联个股后,并可以在客户端的用户界面上呈现不同资讯对应的关联个股。For example, based on the results stored in the operation table, the associated stocks corresponding to each piece of information in the information set can be determined, and the associated stocks corresponding to different pieces of information can be presented on the user interface of the client.
在一些实施例中,所述方法还包括:展示所述关联个股。例如,若所述关联个股为多个关联个股,则可以按照预设排序的方式,在客户端的用户界面上展示多个关联股票。例如,该预设排序可以包括为多个关联股票的发布时间排序、多个关联股票对应的用户持仓数量排序、多个关联股票对应的用户点击率排序等。In some embodiments, the method further includes: displaying the associated stocks. For example, if the associated stocks are multiple associated stocks, the multiple associated stocks can be displayed on the user interface of the client in a preset sorting manner. For example, the preset sorting may include sorting the release time of the multiple associated stocks, sorting the number of user positions corresponding to the multiple associated stocks, sorting the user click rate corresponding to the multiple associated stocks, etc.
例如,本申请实施例对应的信息关联方法对应的算法逻辑完成后,需要为业务侧的开发人员提供服务接口,以接受业务侧输入的资讯数据,并返回关联到的关联个股列表及其他辅助信息。例如,可以配合业务侧的开发人员制定服务的接口协议如下:For example, after the algorithm logic corresponding to the information association method corresponding to the embodiment of the present application is completed, it is necessary to provide a service interface for the developers on the business side to accept the information data input by the business side and return the associated list of associated stocks and other auxiliary information. For example, the interface protocol of the service can be formulated in cooperation with the developers on the business side as follows:
1)提供2种接口方法:a.根据文章ID号(doc_id)获取关联结果:需要传入文章ID、文章类型,再根据以上信息在中台查询到相对应的文章文本,支持线上的文章类型可以包括但不限于:资讯、金融专栏新闻、快讯等。b.根据文本(text)获取关联结果:只需要传入资讯文本字符串即可查询到对应的资讯数据。1) Provide 2 interface methods: a. Get related results based on article ID (doc_id): You need to pass in the article ID and article type, and then query the corresponding article text in the middle platform based on the above information. The supported online article types include but are not limited to: information, financial column news, news flash, etc. b. Get related results based on text (text): You only need to pass in the information text string to query the corresponding information data.
2)增加doc_lang语言参数,以支持中文、英文两种不同语言的资讯。2) Added the doc_lang language parameter to support information in two different languages: Chinese and English.
3)返回关联个股列表时,关联个股列表中的每只关联个股需要包含个股ID、个股代码、个股中文简称、个股相关的实体列表等,各个实体还需要包含个股-实体的相关度等结果。3) When returning the list of associated stocks, each associated stock in the list of associated stocks needs to include the stock ID, stock code, stock Chinese abbreviation, stock-related entity list, etc. Each entity also needs to include results such as stock-entity correlation.
上述所有的技术方案,可以采用任意结合形成本申请的可选实施例,在此不再一一赘述。All of the above technical solutions can be arbitrarily combined to form optional embodiments of the present application, which will not be described one by one here.
本申请实施例提供的信息关联方法,通过提取资讯集中每篇资讯的实体信息,实体信息包括至少一个实体,然后根据个股集、资讯集以及资讯集中每篇资讯的实体信息,确定全局统计关系信息,然后根据全局统计关系信息,确定资讯集中每篇资讯对应的关联个股,关联个股表示与对应的资讯相关的个股,可以提升资讯与个股的关联准确度,提高资讯分发效率。相比于个股名称完全匹配的个股关联方式,本申请实施例优化了个股名称完全匹配的个股关联方式所带来的误关联、漏关联等问题,可以更准确地确定个股和资讯的关联程度,以提高资讯与个股关联的准确率。The information association method provided in the embodiment of the present application extracts the entity information of each piece of information in the information set, the entity information includes at least one entity, and then determines the global statistical relationship information based on the stock set, the information set and the entity information of each piece of information in the information set, and then determines the associated stocks corresponding to each piece of information in the information set based on the global statistical relationship information. The associated stocks represent the stocks related to the corresponding information, which can improve the accuracy of the association between information and stocks and improve the efficiency of information distribution. Compared with the individual stock association method with a complete match of the stock name, the embodiment of the present application optimizes the problems of false association and missed association caused by the individual stock association method with a complete match of the stock name, and can more accurately determine the degree of association between individual stocks and information to improve the accuracy of the association between information and individual stocks.
为便于更好的实施本申请实施例的信息关联方法,本申请实施例还提供一种客户端。请参阅图3,图3为本申请实施例提供的信息关联装置的结构示意图。其中,该信息关联装置200可以包括:In order to better implement the information association method of the embodiment of the present application, the embodiment of the present application also provides a client. Please refer to Figure 3, which is a schematic diagram of the structure of the information association device provided by the embodiment of the present application. Among them, the information association device 200 may include:
提取单元210,用于提取资讯集中每篇资讯的实体信息,所述实体信息包括至少一个实体;An extraction unit 210, configured to extract entity information of each piece of information in the information set, wherein the entity information includes at least one entity;
第一确定单元220,用于根据个股集、资讯集以及所述资讯集中每篇资讯的实体信息,确定全局统计关系信息;A first determining unit 220, configured to determine global statistical relationship information based on the stock set, the information set, and entity information of each piece of information in the information set;
第二确单元230,用于根据所述全局统计关系信息,确定所述资讯集中每篇资讯对应的关联个股,所述关联个股表示与对应的资讯相关的个股。The second confirmation unit 230 is used to determine the associated stocks corresponding to each piece of information in the information set according to the global statistical relationship information, and the associated stocks represent stocks related to the corresponding information.
在一些实施例中,所述第一确定单元220,用于:In some embodiments, the first determining unit 220 is configured to:
根据所述个股集、所述资讯集以及所述资讯集中每篇资讯的实体信息,确定每篇资讯中各个实体与个股集中每只个股的第一共现关系;Determine, based on the stock set, the information set, and entity information of each piece of information in the information set, a first co-occurrence relationship between each entity in each piece of information and each stock in the stock set;
根据所述资讯集中每篇资讯的实体信息,确定每篇资讯中各个实体间的第二共现关系;Determining a second co-occurrence relationship between entities in each piece of information according to entity information of each piece of information in the information set;
根据所述个股集、所述资讯集以及所述资讯集中每篇资讯的实体信息,确定所述个股集中每只个股与每篇资讯中各个实体的第一关联度。According to the individual stock set, the information set and the entity information of each piece of information in the information set, a first correlation degree between each individual stock in the individual stock set and each entity in each piece of information is determined.
在一些实施例中,所述提取单元210,还用于:In some embodiments, the extraction unit 210 is further configured to:
获取个股集中每只个股对应的个股资讯列表,每只个股对应的个股资讯列表中存储有至少一篇具有初始关联关系的资讯;Obtain a list of individual stock information corresponding to each individual stock in the individual stock set, wherein the list of individual stock information corresponding to each individual stock stores at least one piece of information having an initial association relationship;
根据所述个股资讯列表,获取资讯集,所述资讯集包含所有个股资讯列表中的所有资讯。According to the individual stock information list, an information set is obtained, where the information set includes all the information in all the individual stock information lists.
在一些实施例中,所述第一确定单元220在根据所述个股集、所述资讯集以及所述资讯集中每篇资讯的实体信息,确定每篇资讯中各个实体与个股集中每只个股的第一共现关系时,用于:In some embodiments, when determining the first co-occurrence relationship between each entity in each piece of information and each stock in the stock set based on the stock set, the information set, and the entity information of each piece of information in the information set, the first determining unit 220 is used to:
根据所述个股集中每只个股对应的个股资讯列表和所述资讯集,确定所述资讯集中每篇资讯对应的具有初始关联关系的个股;According to the individual stock information list corresponding to each individual stock in the individual stock set and the information set, determining the individual stock with an initial association relationship corresponding to each piece of information in the information set;
根据所述资讯集中每篇资讯对应的具有初始关联关系的个股,以及所述资讯集中每篇资讯的实体信息,确定每篇资讯中各个实体与个股集中每只个股的第一共现关系。 According to the individual stocks with initial association relationships corresponding to each piece of information in the information set and the entity information of each piece of information in the information set, the first co-occurrence relationship between each entity in each piece of information and each individual stock in the individual stock set is determined.
在一些实施例中,所述第一确定单元220在根据所述个股集、所述资讯集以及所述资讯集中每篇资讯的实体信息,确定所述个股集中每只个股与每篇资讯中各个实体与的第一关联度时,用于:In some embodiments, when determining the first degree of association between each stock in the stock set and each entity in each piece of information based on the stock set, the information set, and the entity information of each piece of information in the information set, the first determining unit 220 is used to:
根据所述个股集中每只个股对应的个股资讯列表、所述资讯集以及所述资讯集中每篇资讯的实体信息,确定每篇资讯中第i个实体与所述个股集中第j只个股的共现资讯篇数,确定所述第j只个股对应的资讯总篇数,确定所述资讯集的资讯总篇数,以及确定所述资讯集中出现过所述第i个实体的资讯总篇数;According to the list of individual stock information corresponding to each individual stock in the individual stock set, the information set and the entity information of each piece of information in the information set, determine the number of co-occurrence pieces of information of the i-th entity in each piece of information and the j-th stock in the individual stock set, determine the total number of pieces of information corresponding to the j-th stock, determine the total number of pieces of information in the information set, and determine the total number of pieces of information in which the i-th entity appears in the information set;
根据每篇资讯中第i个实体与所述个股集中第j只个股的共现资讯篇数,以及所述第j只个股对应的资讯总篇数,确定所述第j只个股相对于所述第i个实体的实体频率;Determine the entity frequency of the jth stock relative to the i-th entity according to the number of co-occurring information articles of the i-th entity and the j-th stock in the stock set in each information article, and the total number of information articles corresponding to the j-th stock;
根据所述资讯集的资讯总篇数,以及所述资讯集中出现过所述第i个实体的资讯总篇数,确定所述第i个实体的逆文档频率;Determine the inverse document frequency of the i-th entity according to the total number of information pieces in the information set and the total number of information pieces in which the i-th entity appears in the information set;
根据所述实体频率与所述逆文档频率的积,确定所述第j只个股与所述第i个实体的第一关联度;Determining a first correlation degree between the j-th stock and the i-th entity according to the product of the entity frequency and the inverse document frequency;
遍历所述资讯集中每篇资讯的各个实体,确定所述个股集中每只个股与每篇资讯中各个实体与的第一关联度。Each entity in each piece of information in the information set is traversed to determine a first correlation between each stock in the stock set and each entity in each piece of information.
在一些实施例中,所述第二确单元230,用于:In some embodiments, the second confirmation unit 230 is configured to:
根据所述第一共现关系与所述第二共现关系,确定所述个股集中每只个股对应的共现实体列表,所述共现实体列表中的实体为与对应的个股共现过的实体;Determine, according to the first co-occurrence relationship and the second co-occurrence relationship, a co-occurrence entity list corresponding to each individual stock in the individual stock set, wherein the entities in the co-occurrence entity list are entities that have co-occurred with the corresponding individual stock;
根据所述第一关联度对所述共现实体列表中的实体进行排列,将排列后的共现实体列表位于前N位实体,确定为所述个股集中每只个股对应的候选实体;Arrange the entities in the co-existing entity list according to the first association degree, and determine the entities in the top N positions of the arranged co-existing entity list as candidate entities corresponding to each stock in the stock set;
获取每只个股与对应的各个候选实体之间的关联链路标识;Obtain the association link identifier between each stock and each corresponding candidate entity;
当目标个股与对应的各个候选实体之间的关联链路标识满足预设条件,且所述目标个股对应的各个候选实体均属于目标资讯时,将所述目标个股确定为所述目标资讯的关联个股;When the association link identifiers between the target stock and the corresponding candidate entities meet the preset conditions, and the candidate entities corresponding to the target stock all belong to the target information, the target stock is determined as an associated stock of the target information;
其中,所述目标个股为所述个股集中的任一个股,所述目标资讯为所述资讯集中的任一资讯。The target stock is any stock in the stock set, and the target information is any information in the information set.
在一些实施例中,所述提取单元210,用于:In some embodiments, the extraction unit 210 is used to:
获取所述资讯集中每篇资讯的资讯文本数据;Obtaining information text data of each piece of information in the information set;
基于实体抽取模型对每篇资讯的资讯文本数据进行处理,得到每篇资讯的实体信息,其中,所述实体抽取模型用于抽取所述资讯文本数据中的预设实体。The information text data of each piece of information is processed based on an entity extraction model to obtain entity information of each piece of information, wherein the entity extraction model is used to extract preset entities in the information text data.
上述所有的技术方案,可以采用任意结合形成本申请的可选实施例,在此不再一一赘述。All of the above technical solutions can be arbitrarily combined to form optional embodiments of the present application, which will not be described one by one here.
应理解的是,信息关联装置实施例与方法实施例可以相互对应,类似的描述可以参照方法实施例。为避免重复,此处不再赘述。具体地,图中所示的信息关联装置可以执行上述信息关联方法实施例,并且信息关联装置中的各个单元的前述和其它操作和/或功能分别实现上述方法实施例的相应流程,为了简洁,在此不再赘述。It should be understood that the information association device embodiment and the method embodiment can correspond to each other, and similar descriptions can refer to the method embodiment. To avoid repetition, no further description is given here. Specifically, the information association device shown in the figure can execute the above-mentioned information association method embodiment, and the aforementioned and other operations and/or functions of each unit in the information association device respectively implement the corresponding processes of the above-mentioned method embodiment, which will not be repeated here for the sake of brevity.
可选的,本申请还提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机程序,该处理器执行计算机程序时实现上述各方法实施例中的步骤。Optionally, the present application further provides a computer device, including a memory and a processor, wherein a computer program is stored in the memory, and the processor implements the steps in the above-mentioned method embodiments when executing the computer program.
图4为本申请实施例提供的计算机设备的结构示意图,该计算机设备可以是终端或服务器。如图4所示,该计算机设备300可以包括:通信接口301,存储器302,处理器303和通信总线304。通信接口301,存储器302,处理器303通过通信总线304实现相互间的通信。通信接口301用于计算机设备300与外部设备进行数据通信。存储器302可用于存储软件程序以及模块,处理器303通过运行存储在存储器302的软件程序以及模块,例如前述方法实施例中的相应操作的软件程序。FIG4 is a schematic diagram of the structure of a computer device provided in an embodiment of the present application, and the computer device may be a terminal or a server. As shown in FIG4 , the computer device 300 may include: a communication interface 301, a memory 302, a processor 303 and a communication bus 304. The communication interface 301, the memory 302, and the processor 303 communicate with each other through the communication bus 304. The communication interface 301 is used for the computer device 300 to communicate data with an external device. The memory 302 may be used to store software programs and modules, and the processor 303 runs the software programs and modules stored in the memory 302, such as the software programs of the corresponding operations in the aforementioned method embodiment.
可选的,该处理器303可以调用存储在存储器302的软件程序以及模块执行如下操作:Optionally, the processor 303 may call the software program and module stored in the memory 302 to perform the following operations:
提取资讯集中每篇资讯的实体信息,所述实体信息包括至少一个实体;根据个股集、资讯集以及所述资讯集中每篇资讯的实体信息,确定全局统计关系信息;根据所述全局统计关系信息,确定所述资讯集中每篇资讯对应的关联个股,所述关联个股表示与对应的资讯相关的个股。Extract entity information of each piece of information in the information set, wherein the entity information includes at least one entity; determine global statistical relationship information based on the stock set, the information set, and the entity information of each piece of information in the information set; determine the associated stocks corresponding to each piece of information in the information set based on the global statistical relationship information, wherein the associated stocks represent stocks related to the corresponding information.
本领域普通技术人员可以理解,上述实施例的各种方法中的全部或部分步骤可以通过指令来完成,或通过指令控制相关的硬件来完成,该指令可以存储于一计算机可读存储介质中,并由处理器进行加载和执行。A person of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be completed by instructions, or by controlling related hardware through instructions. The instructions may be stored in a computer-readable storage medium and loaded and executed by a processor.
为此,本申请实施例提供一种计算机可读存储介质,其中存储有多条计算机程序,该计算机程序能够被处理器进行加载,以执行本申请实施例所提供的任一种信息关联方法中的步骤。以上各个操作的具体实施可参见前面的实施例,在此不再赘述。To this end, the present application embodiment provides a computer-readable storage medium, in which multiple computer programs are stored, and the computer program can be loaded by a processor to execute the steps in any one of the information association methods provided in the present application embodiment. The specific implementation of each of the above operations can be referred to the previous embodiments, and will not be repeated here.
其中,该存储介质可以包括:只读存储器(Read Only Memory,ROM)、随机存取记忆体(Random Access Memory,RAM)、磁盘或光盘等。Among them, the storage medium may include: Read Only Memory (ROM), Random Access Memory (RAM), disk or CD, etc.
由于该存储介质中所存储的计算机程序,可以执行本申请实施例所提供的任一种信息关联方法中的步骤,因此,可以实现本申请实施例所提供的任一种信息关联方法所能实现的有益效果,详见前面的实施例,在此不再赘述。Since the computer program stored in the storage medium can execute the steps in any one of the information association methods provided in the embodiments of the present application, the beneficial effects that can be achieved by any one of the information association methods provided in the embodiments of the present application can be achieved. Please refer to the previous embodiments for details and will not be repeated here.
本申请实施例还提供了一种计算机程序产品,该计算机程序产品包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得计算机设备执行本申请实施例中的任一种信息关联方法中的相应流程,为了简洁,在此不再赘述。The embodiment of the present application also provides a computer program product, which includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the corresponding process in any one of the information association methods in the embodiment of the present application, which will not be described here for the sake of brevity.
本申请实施例还提供了一种计算机程序,该计算机程序包括计算机指令,计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得计算机设备执行本申请实施例中的任一种信息关联方法中的相应流程,为了简洁,在此不再赘述。The embodiment of the present application also provides a computer program, which includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the corresponding process in any one of the information association methods in the embodiment of the present application, which will not be described here for the sake of brevity.
以上对本申请实施例所提供的一种信息关联方法、客户端、服务器、股权激励***及存储介质进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。 The above is a detailed introduction to an information association method, client, server, equity incentive system and storage medium provided in the embodiments of the present application. Specific examples are used in this article to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only used to help understand the method of the present application and its core idea; at the same time, for technical personnel in this field, according to the ideas of the present application, there will be changes in the specific implementation methods and application scope. In summary, the content of this specification should not be understood as a limitation on the present application.

Claims (10)

  1. 一种信息关联方法,其特征在于,所述方法包括:An information association method, characterized in that the method comprises:
    提取资讯集中每篇资讯的实体信息,所述实体信息包括至少一个实体;Extracting entity information of each piece of information in the information set, wherein the entity information includes at least one entity;
    根据个股集、资讯集以及所述资讯集中每篇资讯的实体信息,确定全局统计关系信息;Determine global statistical relationship information based on the stock set, the information set, and entity information of each piece of information in the information set;
    根据所述全局统计关系信息,确定所述资讯集中每篇资讯对应的关联个股,所述关联个股表示与对应的资讯相关的个股。According to the global statistical relationship information, the associated stocks corresponding to each piece of information in the information set are determined, and the associated stocks represent stocks related to the corresponding information.
  2. 如权利要求1所述的信息关联方法,其特征在于,所述根据个股集、资讯集以及所述资讯集中每篇资讯的实体信息,确定全局统计关系信息,包括:The information association method according to claim 1, characterized in that the step of determining the global statistical relationship information based on the stock set, the information set, and the entity information of each piece of information in the information set comprises:
    根据所述个股集、所述资讯集以及所述资讯集中每篇资讯的实体信息,确定每篇资讯中各个实体与所述个股集中每只个股的第一共现关系;Determine, based on the stock set, the information set, and entity information of each piece of information in the information set, a first co-occurrence relationship between each entity in each piece of information and each stock in the stock set;
    根据所述资讯集中每篇资讯的实体信息,确定每篇资讯中各个实体间的第二共现关系;Determining a second co-occurrence relationship between entities in each piece of information according to entity information of each piece of information in the information set;
    根据所述个股集、所述资讯集以及所述资讯集中每篇资讯的实体信息,确定所述个股集中每只个股与每篇资讯中各个实体的第一关联度。According to the individual stock set, the information set and the entity information of each piece of information in the information set, a first correlation degree between each individual stock in the individual stock set and each entity in each piece of information is determined.
  3. 如权利要求2所述的信息关联方法,其特征在于,在所述提取资讯集中每篇资讯的实体信息之前,还包括:The information association method according to claim 2, characterized in that before extracting the entity information of each piece of information in the information set, it also includes:
    获取个股集中每只个股对应的个股资讯列表,每只个股对应的个股资讯列表中存储有至少一篇具有初始关联关系的资讯;Obtain a list of individual stock information corresponding to each individual stock in the individual stock set, wherein the list of individual stock information corresponding to each individual stock stores at least one piece of information having an initial association relationship;
    根据所述个股资讯列表,获取资讯集,所述资讯集包含所有个股资讯列表中的所有资讯。According to the individual stock information list, an information set is obtained, where the information set includes all the information in all the individual stock information lists.
  4. 如权利要求3所述的信息关联方法,其特征在于,所述根据所述个股集、所述资讯集以及所述资讯集中每篇资讯的实体信息,确定每篇资讯中各个实体与所述个股集中每只个股的第一共现关系,包括:The information association method according to claim 3, characterized in that the step of determining the first co-occurrence relationship between each entity in each piece of information and each stock in the stock set based on the stock set, the information set, and the entity information of each piece of information in the information set comprises:
    根据所述个股集中每只个股对应的个股资讯列表和所述资讯集,确定所述资讯集中每篇资讯对应的具有初始关联关系的个股;According to the individual stock information list corresponding to each individual stock in the individual stock set and the information set, determining the individual stock with an initial association relationship corresponding to each piece of information in the information set;
    根据所述资讯集中每篇资讯对应的具有初始关联关系的个股,以及所述资讯集中每篇资讯的实体信息,确定每篇资讯中各个实体与所述个股集中每只个股的第一共现关系。According to the stocks with initial association relationships corresponding to each piece of information in the information set and the entity information of each piece of information in the information set, the first co-occurrence relationship between each entity in each piece of information and each stock in the stock set is determined.
  5. 如权利要求3所述的信息关联方法,其特征在于,所述根据所述个股集、所述资讯集以及所述资讯集中每篇资讯的实体信息,确定所述个股集中每只个股与每篇资讯中各个实体与的第一关联度,包括:The information association method according to claim 3, characterized in that the step of determining the first degree of association between each stock in the stock set and each entity in each piece of information based on the stock set, the information set, and the entity information of each piece of information in the information set comprises:
    根据所述个股集中每只个股对应的个股资讯列表、所述资讯集以及所述资讯集中每篇资讯的实体信息,确定每篇资讯中第i个实体与所述个股集中第j只个股的共现资讯篇数,确定所述第j只个股对应的资讯总篇数,确定所述资讯集的资讯总篇数,以及确定所述资讯集中出现过所述第i个实体的资讯总篇数;According to the list of individual stock information corresponding to each individual stock in the individual stock set, the information set and the entity information of each piece of information in the information set, determine the number of co-occurrence pieces of information of the i-th entity in each piece of information and the j-th stock in the individual stock set, determine the total number of pieces of information corresponding to the j-th stock, determine the total number of pieces of information in the information set, and determine the total number of pieces of information in which the i-th entity appears in the information set;
    根据每篇资讯中第i个实体与所述个股集中第j只个股的共现资讯篇数,以及所述第j只个股对应的资讯总篇数,确定所述第j只个股相对于所述第i个实体的实体频率;Determine the entity frequency of the jth stock relative to the i-th entity according to the number of co-occurring information articles of the i-th entity and the j-th stock in the stock set in each information article, and the total number of information articles corresponding to the j-th stock;
    根据所述资讯集的资讯总篇数,以及所述资讯集中出现过所述第i个实体的资讯总篇数,确定所述第i个实体的逆文档频率;Determine the inverse document frequency of the i-th entity according to the total number of information pieces in the information set and the total number of information pieces in which the i-th entity appears in the information set;
    根据所述实体频率与所述逆文档频率的积,确定所述第j只个股与所述第i个实体的第一关联度;Determining a first correlation degree between the j-th stock and the i-th entity according to the product of the entity frequency and the inverse document frequency;
    遍历所述资讯集中每篇资讯的各个实体,确定所述个股集中每只个股与每篇资讯中各个实体与的第一关联度。Each entity in each piece of information in the information set is traversed to determine a first correlation between each stock in the stock set and each entity in each piece of information.
  6. 如权利要求2-5任一项所述的信息关联方法,其特征在于,所述根据所述全局统计关系信息,确定所述资讯集中每篇资讯对应的关联个股,包括:The information association method according to any one of claims 2 to 5, characterized in that determining the associated stocks corresponding to each piece of information in the information set according to the global statistical relationship information comprises:
    根据所述第一共现关系与所述第二共现关系,确定所述个股集中每只个股对应的共现实体列表,所述共现实体列表中的实体为与对应的个股共现过的实体;Determine, according to the first co-occurrence relationship and the second co-occurrence relationship, a co-occurrence entity list corresponding to each individual stock in the individual stock set, wherein the entities in the co-occurrence entity list are entities that have co-occurred with the corresponding individual stock;
    根据所述第一关联度对所述共现实体列表中的实体进行排列,将排列后的共现实体列表位于前N位实体,确定为所述个股集中每只个股对应的候选实体;Arrange the entities in the co-existing entity list according to the first association degree, and determine the entities in the top N positions of the arranged co-existing entity list as candidate entities corresponding to each stock in the stock set;
    获取每只个股与对应的各个候选实体之间的关联链路标识;Obtain the association link identifier between each stock and each corresponding candidate entity;
    当目标个股与对应的各个候选实体之间的关联链路标识满足预设条件,且所述目标个股对应的各个候选实体均属于目标资讯时,将所述目标个股确定为所述目标资讯的关联个股;When the association link identifiers between the target stock and the corresponding candidate entities meet the preset conditions, and the candidate entities corresponding to the target stock all belong to the target information, the target stock is determined as an associated stock of the target information;
    其中,所述目标个股为所述个股集中的任一个股,所述目标资讯为所述资讯集中的任一资讯。The target stock is any stock in the stock set, and the target information is any information in the information set.
  7. 如权利要求1所述的信息关联方法,其特征在于,所述提取资讯集中每篇资讯的实体信息,包括:The information association method according to claim 1, wherein the step of extracting entity information of each piece of information in the information set comprises:
    获取所述资讯集中每篇资讯的资讯文本数据;Obtaining information text data of each piece of information in the information set;
    基于实体抽取模型对每篇资讯的资讯文本数据进行处理,得到每篇资讯的实体信息,其中,所述实体抽取模型用于抽取所述资讯文本数据中的预设实体。The information text data of each piece of information is processed based on an entity extraction model to obtain entity information of each piece of information, wherein the entity extraction model is used to extract preset entities in the information text data.
  8. 一种信息关联装置,其特征在于,所述装置包括:An information association device, characterized in that the device comprises:
    提取单元,用于提取资讯集中每篇资讯的实体信息,所述实体信息包括至少一个实体;An extraction unit, used to extract entity information of each piece of information in the information set, wherein the entity information includes at least one entity;
    第一确定单元,用于根据个股集、资讯集以及所述资讯集中每篇资讯的实体信息,确定全局统计关系信息;A first determining unit is used to determine global statistical relationship information based on the stock set, the information set and the entity information of each piece of information in the information set;
    第二确单元,用于根据所述全局统计关系信息,确定所述资讯集中每篇资讯对应的关联个股,所述关联个股表示与对应的资讯相关的个股。 The second confirmation unit is used to determine the associated stocks corresponding to each piece of information in the information set according to the global statistical relationship information, and the associated stocks represent stocks related to the corresponding information.
  9. 一种计算机设备,其特征在于,所述计算机设备包括处理器和存储器,所述存储器中存储有计算机程序,所述处理器通过调用所述存储器中存储的所述计算机程序,用于执行权利要求1-7任一项所述的信息关联方法。A computer device, characterized in that the computer device includes a processor and a memory, the memory stores a computer program, and the processor executes the information association method according to any one of claims 1 to 7 by calling the computer program stored in the memory.
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序适于处理器进行加载,以执行如权利要求1-7任一项所述的信息关联方法。 A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, and the computer program is suitable for being loaded by a processor to execute the information association method according to any one of claims 1 to 7.
PCT/CN2023/112709 2022-12-21 2023-08-11 Information association method and apparatus, device, and storage medium WO2024131091A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211660198.7A CN115984004A (en) 2022-12-21 2022-12-21 Information association method, device, equipment and storage medium
CN202211660198.7 2022-12-21

Publications (1)

Publication Number Publication Date
WO2024131091A1 true WO2024131091A1 (en) 2024-06-27

Family

ID=85959170

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/112709 WO2024131091A1 (en) 2022-12-21 2023-08-11 Information association method and apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN115984004A (en)
WO (1) WO2024131091A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115984004A (en) * 2022-12-21 2023-04-18 深圳市富途网络科技有限公司 Information association method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110889024A (en) * 2019-10-25 2020-03-17 武汉灯塔之光科技有限公司 Method and device for calculating information-related stock
US20200097605A1 (en) * 2018-09-25 2020-03-26 Microsoft Technology Licensing, Llc Machine learning techniques for automatic validation of events
CN113378555A (en) * 2021-06-22 2021-09-10 富途网络科技(深圳)有限公司 Intelligent association method for individual stock and related product
CN113868431A (en) * 2021-09-16 2021-12-31 佳兆业投资咨询(深圳)有限公司 Financial knowledge graph-oriented relation extraction method and device and storage medium
CN115984004A (en) * 2022-12-21 2023-04-18 深圳市富途网络科技有限公司 Information association method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200097605A1 (en) * 2018-09-25 2020-03-26 Microsoft Technology Licensing, Llc Machine learning techniques for automatic validation of events
CN110889024A (en) * 2019-10-25 2020-03-17 武汉灯塔之光科技有限公司 Method and device for calculating information-related stock
CN113378555A (en) * 2021-06-22 2021-09-10 富途网络科技(深圳)有限公司 Intelligent association method for individual stock and related product
CN113868431A (en) * 2021-09-16 2021-12-31 佳兆业投资咨询(深圳)有限公司 Financial knowledge graph-oriented relation extraction method and device and storage medium
CN115984004A (en) * 2022-12-21 2023-04-18 深圳市富途网络科技有限公司 Information association method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN115984004A (en) 2023-04-18

Similar Documents

Publication Publication Date Title
CN105677844B (en) A kind of orientation of moving advertising big data pushes and user is across screen recognition methodss
Korobchinsky et al. Peculiarities of content forming and analysis in internet newspaper covering music news
CN102073725B (en) Method for searching structured data and search engine system for implementing same
US20220198327A1 (en) Method, apparatus, device and storage medium for training dialogue understanding model
US20130013616A1 (en) Systems and Methods for Natural Language Searching of Structured Data
CN101420313B (en) Method and system for clustering customer terminal user group
CN102073726B (en) Structured data import method and device for search engine system
CN111666490A (en) Information pushing method, device, equipment and storage medium based on kafka
US20240152558A1 (en) Search activity prediction
CN101114294A (en) Self-help intelligent uprightness searching method
CN100444591C (en) Method for acquiring front-page keyword and its application system
CN105550206B (en) The edition control method and device of structured query sentence
US20200134511A1 (en) Systems and methods for identifying documents with topic vectors
WO2024131091A1 (en) Information association method and apparatus, device, and storage medium
CN111552788B (en) Database retrieval method, system and equipment based on entity attribute relationship
CN114817481A (en) Big data-based intelligent supply chain visualization method and device
Nadee et al. Towards data extraction of dynamic content from JavaScript Web applications
CN113962597A (en) Data analysis method and device, electronic equipment and storage medium
CN111737607B (en) Data processing method, device, electronic equipment and storage medium
CN111126073B (en) Semantic retrieval method and device
CN113010542A (en) Service data processing method and device, computer equipment and storage medium
CN113254623B (en) Data processing method, device, server, medium and product
CN113761231B (en) Text character feature-based text data attribution description and generation method
CN115510247A (en) Method, device, equipment and storage medium for constructing electric carbon policy knowledge graph
KR20100136438A (en) Method and apparutus for automatic contents generation