WO2020185110A1 - Procédé et système de recherche de nouvelles pertinentes - Google Patents

Procédé et système de recherche de nouvelles pertinentes Download PDF

Info

Publication number
WO2020185110A1
WO2020185110A1 PCT/RU2019/000162 RU2019000162W WO2020185110A1 WO 2020185110 A1 WO2020185110 A1 WO 2020185110A1 RU 2019000162 W RU2019000162 W RU 2019000162W WO 2020185110 A1 WO2020185110 A1 WO 2020185110A1
Authority
WO
WIPO (PCT)
Prior art keywords
news
lemmas
company
events
event
Prior art date
Application number
PCT/RU2019/000162
Other languages
English (en)
Russian (ru)
Inventor
Федор Борисович ФЕДОРОВ
Александра Евгеньевна ЛИПАЧЕВА
Владимир Алексеевич КУЗНЕЦОВ
Роман Владиславович ЧЕРКАСОВ
Original Assignee
Публичное Акционерное Общество "Сбербанк России"
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Публичное Акционерное Общество "Сбербанк России" filed Critical Публичное Акционерное Общество "Сбербанк России"
Publication of WO2020185110A1 publication Critical patent/WO2020185110A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures

Definitions

  • the present technical solution generally relates to the field of information technology, and in particular to search engines designed to identify relevant information from heterogeneous data sources.
  • a common drawback of existing approaches is the lack of a method for identifying relevant news in relation to binding to a news object, for example, a company and a corresponding event associated with it, which does not allow efficient collection of relevant information from multiple data sources.
  • the technical problem or technical problem to be solved using the claimed approach is to provide a process for searching and generating a set of news with reference to a given set of company names as news objects and events about which information appears in open data sources.
  • the technical result achieved when solving the above technical problem is to provide the formation of a related set of information from news sources with a grouping by companies that are the object of news and given types of events.
  • An additional technical result is to improve the accuracy of identifying information about companies for a given type of event in publicly available information sources.
  • the filtering is performed by calculating the Jacquard measure between news signatures.
  • the assigned company news is stored in a database.
  • the news aggregator updates the news list using information channels.
  • the information channels are web sites on the Internet and / or messenger channels.
  • the ownership of the company mentioned in the news is determined using a decision tree algorithm.
  • a statistical measure is calculated for each lemma of the news text during lemmatization.
  • the machine learning algorithm is a logical regression, classifying the belonging of the news to the event based on the analysis of the statistical measure of the lemmas.
  • the machine learning algorithm is a gradient boosting trained to classify an event based on the number of sentences containing lemmas identifying an event from a search query.
  • a relevant news search system comprising at least one processor and at least one memory that contains machine-readable instructions that, when executed by at least one processor, perform the above method.
  • FIG. 1 illustrates the interaction of the elements included in the claimed solution.
  • FIG. 2 illustrates the general flow of the method.
  • FIG. 3 illustrates the processing of text data.
  • FIG. 4 shows an example of a graphical user interface when interacting with a service for the selection of relevant news.
  • FIG. 5 illustrates a general view of a computing device.
  • FIG. 1 shows the general computing architecture (100) of the presented solution.
  • Main functionality for collection and processing information is executed on a control server (ON), which, through a data transmission channel, receives information from a news aggregator server (120), which is connected via the Internet (150) to a plurality of news resources (130).
  • the server (software) provides interaction with users (10) to display data on the collected news information, as well as additional functionality, which will be disclosed later in the application materials.
  • the Internet or Intranet can be used as a data transmission channel between the management server (110) and the server of the news aggregator (120).
  • the server of the news aggregator (120) can represent several devices that are part of a different network environment, for example, a set of servers, routers, clusters, etc.
  • a data transmission channel can be organized using various types of known data transmission protocols, both wired and wireless, for example, TCP / IP, 802.11, Ethernet, FTP, etc., providing the formation of various network interactions, in particular LAN, WAN, PAN , WLAN, etc.
  • the Management server performs the main processing of information received from the server of the news aggregator (120), stores and generates data for display to users (10).
  • Information display can be formed using a specialized graphical user interface.
  • Users (10) can interact with the control server (110) using a web portal or other type of software application that provides access to aggregated news information. Access can be provided, for example, through an API.
  • the interaction of users (10) can be carried out using various electronic devices, which can be, for example, a computer, laptop, smartphone, tablet, game console, smart wearable electronic device, thin client, as well as devices of augmented, mixed or virtual reality and dr.
  • the server of the news aggregator (120) is connected via the Internet (150) with various information resources (130) or information channels providing news information.
  • Such resources (130) can be, for example, websites, messenger channels (Telegram TM, WatsApp TM, Viber TM, etc.), social networks (Facebook TM, V.e TM, etc.).
  • Saving the received information on the server (110) can be carried out in JSON format in a data store, for example, a database. In this case, the source of obtaining news information and the date of its placement on the corresponding resource can be taken into account (130).
  • FIG. 2 shows a general process for performing the claimed method of searching for relevant news information (200).
  • Information from news sources, collected and stored on the server of the news aggregator (120) is transmitted (201) to the control server (software).
  • Information from the server of the news aggregator (120) can be transmitted online or offline.
  • In online mode data from the Internet (150) is transmitted as soon as it appears on a web resource to which the news aggregator server has a connection (120).
  • news is stored on the news aggregator server (120), for example, in a database, and at a set time (for example, every hour, once a day, etc.) or upon request from the control server (software) are transmitted to it ...
  • Data from the news aggregator server (120) can be transmitted in various formats, for example, xml, html, txt, and the like.
  • the data format for transmission can also change depending on the mode of information transmission to the control server (1-10).
  • the news data contains information about the companies mentioned in the text.
  • the search for relevant information according to the data received from the server of the news aggregator (120) is carried out by processing (202) the obtained data array using a machine learning model, which is trained to search by company names (2021) and corresponding events (2022) in the array textual information and make judgments about the relevance of the relevant information.
  • Data processing at server (110) is performed upon receipt of a new data array from the news aggregator server (120), or according to a predetermined scenario.
  • An automatic script can be configured as a script, which at a set time activates a machine learning model for data processing (202).
  • an access is made to the information store of the management server (UE), which contains data received from the news aggregator server (120) from news sources (130).
  • UE management server
  • the information stored on the control server (HO) it is processed (202) to identify relevant data and bind data (203) from news to the corresponding types of events during information processing using a machine learning model.
  • FIG. 3 shows a process (300) for processing news data received from a news aggregator server (120), which is performed during steps (202) - (203).
  • the news text data received from the server of the news aggregator (120) is lemmatized, during which the text corpus of each news is divided into lemmas. From the received data, news text and metadata from files are extracted.
  • the body of the news is divided into words by all punctuation separators, and then reduced to normal form, for example, using the pymorphy2 library. Then the text is converted, in particular, the text is cleared of punctuation marks, stop words (prepositions, conjunctions, pronouns) and nominal entities. In this case, a nominal entity is considered to be any word that begins with a capital letter and is not the first word in a sentence. Also, the process of N-gramming (https://ru.wikipedia.org/wiki/N-rpaMMa) can be performed, in which the most frequent word combinations of length from 2 to 10 lemmas are highlighted in the text. The list of the most frequent phrases was obtained by automatic analysis of a large body of text and contains more than 9 million objects.
  • incoming news goes through a deduplication process, during which duplicate news is filtered out.
  • the MinHash signature is calculated for each news item (see. https://en.wikipedia.org/wiki/MinHash), after which, for each pair of news, the similarity of signatures is calculated according to the Jaccard measure (sometimes the Jaccard coefficient). If the similarity of a pair of news exceeds the specified threshold, for example, 0.7, then the shorter news from the pair of text corpuses is considered duplicate and is not subjected to further processing.
  • next step (302) after lemmatization of the news texts, processing of the normalized text is performed.
  • the text of the news is searched for the names of companies that do not have homonyms (for example, Sberbank TM). All phrases with a capital letter and in quotes are found, after which the lemmas of the found phrases are searched in the list of companies stored in the server database (110).
  • the classification threshold is 0.5.
  • the received lemmas from the news body are processed at step (303) using machine learning models.
  • logical regression can be applied by calculating the TF-IDF statistical measure for text lemmas (see https://ru.wikipedia.org/wiki/TF-IDF). For each lemma in the text, a statistical measure is calculated, after which a pre-trained logistic regression judgment is made on the obtained features.
  • a list of given lemmas is compiled, for example, a list may contain 30-40 lemmas that have most weight in logistic regression. The list is built for each event after the logistic regression learning process.
  • the weight of each lemma is determined, according to which the selection of lemmas for the list is carried out based on the values of their weights.
  • a set of lemmas can be specified, for example, 10-15 lemmas, which are most appropriate for the event, which are selected from a previously defined list of lemmas, and if the event was assigned to news at step (304), then all lemmas found in the text from mentioned set are highlighted in the text.
  • a second example of an application of the machine learning model is a gradient boosting classifying algorithm such as LightGBM (https://lightgbm.readthedocs.io). For each news text, the number of sentences containing pairs of lemmas characteristic of the event is counted. Pairs of characteristic lemmas are selected for each event during the training of the classifier. The characteristic lemmas (and their number) are selected automatically during training.
  • LightGBM https://lightgbm.readthedocs.io
  • FIG. 4 shows an example of a graphical user interface (400) for interacting with a service for the selection of relevant news information.
  • the interface (400) provides functionality for displaying and managing the content of the provided data.
  • the search query is formed using the name information input panel companies (401).
  • the main field (404) for displaying current or found information a list of companies is presented for which processing of identifying relevant information from the server database (110) is carried out.
  • Companies in field (404) can be displayed in a different hierarchical order, for example, alphabetically, by the number of news, and the like.
  • the information can be filtered by the time range, which is set in the date input field (402).
  • the interface (400) contains a control panel for setting parameters of search queries (403). Using the control panel (403), you can configure the detection of certain types of events, link companies, configure service parameters, etc.
  • the field (405) displays a list of identified news sources in accordance with the specified events for the companies.
  • Users (10) can also set an alert function for selected company names. Notifications about the arrival of new news can be sent via e-mail messages, PUSH notifications, SMS notifications, etc.
  • the user (10) can configure the required parameters, for example, the name of the company, the type of events related to companies.
  • the generated information on processed news can also be displayed using a filter customized with respect to the role of the user (10) interacting with the interface (400). Taking into account the parameter of the user account (10), only those news that contain the type of events associated with his role can be displayed to him.
  • FIG. 5 shows an example of a general view of the device (500), which provides the implementation of the presented solution.
  • a different range of computing devices can be implemented on the basis of the device (500), for example, a control server (110), a news aggregator server (120), user devices (10), etc.
  • the device (500) contains one or more processors (501) united by a common data exchange bus, memory means such as RAM (502) and ROM (503), input / output interfaces (504), input / output (505), and a device for networking (506).
  • the processor (501) (or multiple processors, multi-core processor, etc.) can be selected from a range of devices currently widely used, for example, manufacturers such as: Intel TM, AMD TM, Apple TM, Samsung Exynos TM, MediaTEK TM, Qualcomm Snapdragon TM, etc.
  • the graphics processor for example, NVIDIA GPU or Graphcore, the type of which is also suitable for full or partial execution of the method (200), and can also be used for training and applying machine models. training in various information systems.
  • RAM (502) is a random access memory and is intended for storing machine-readable instructions executed by the processor (501) for performing the necessary operations for logical data processing.
  • RAM (502) typically contains executable instructions of an operating system and associated software components (applications, software modules, etc.). In this case, the available memory of the graphics card or graphics processor can act as RAM (502).
  • ROM (503) is one or more means for permanent storage of data, for example, hard disk drive (HDD), solid state data storage device (SSD), flash memory (EEPROM, NAND, etc.), optical storage media (CD-R / RW, DVD-R / RW, BlueRay Disc, MD), etc.
  • I / O interfaces are used to organize the operation of the components of the device (500) and to organize the operation of external connected devices.
  • the choice of the appropriate interfaces depends on the specific version of the computing device, which can be, but are not limited to: PCI, AGP, PS / 2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS / Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232, etc.
  • I / O information are used, for example, a keyboard, display (monitor), touch display, touch-pad, joystick, mouse manipulator, light pen, stylus, touchpad, trackball, speakers, microphone, augmented reality, optical sensors, tablet, light indicators, projector, camera, biometric identification (retina scanner, fingerprint scanner, voice recognition module), etc.
  • the networking tool (506) provides data transmission via an internal or external computer network, for example, Intranet, Internet, LAN and the like.
  • One or more means (506) may be used, but not limited to: Ethernet card, GSM modem, GPRS modem, LTE modem, 5G modem, satellite communication module, NFC module, Bluetooth and / or BLE module, Wi-Fi module and dr.
  • satellite navigation aids can be used as part of the device (500), for example, GPS, GLONASS, BeiDou, Galileo.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

La présente invention se rapporte au domaine des techniques d'informations et concerne notamment des mécanismes de recherche servant à découvrir des informations pertinentes à partir de sources de données de divers types. Le résultat technique consiste en la génération d'un ensemble lié d'informations à partir de sources de nouvelles avec un regroupement en fonction des sociétés qui sont à l'origine des nouvelles et selon des types d'évènements donnés. Dans un premier mode de réalisation préféré, la présente invention concerne un procédé mis en œuvre par ordinateur de recherche de nouvelles pertinentes, qui consiste à obtenir sur un serveur de commande un ensemble de nouvelles à partir d'au moins un serveur d'agrégateur de nouvelles; effectuer sur le serveur de commande une analyse de l'ensemble des nouvelles obtenu qui comprend une lemmatisation des textes de chaque nouvelle depuis ledit serveur de nouvelles; traiter les lemmes obtenus des textes de nouvelles à l'aide d'un modèle d'apprentissage machine qui comprend un ensemble prédéfini de données de sociétés et une liste des évènements, un ensemble donné de lemmes étant établi pour chaque évènement dans le modèle d'apprentissage machine; déterminer les nouvelles contenant des lemmes identifiant des évènements donnés et générer un lien des évènements découverts avec au moins une société; puis générer une liste des nouvelles pertinentes sur la base de l'analyse effectuée.
PCT/RU2019/000162 2019-03-14 2019-03-14 Procédé et système de recherche de nouvelles pertinentes WO2020185110A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
RU2019107328 2019-03-14
RU2019107328A RU2698916C1 (ru) 2019-03-14 2019-03-14 Способ и система поиска релевантных новостей

Publications (1)

Publication Number Publication Date
WO2020185110A1 true WO2020185110A1 (fr) 2020-09-17

Family

ID=67851403

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2019/000162 WO2020185110A1 (fr) 2019-03-14 2019-03-14 Procédé et système de recherche de nouvelles pertinentes

Country Status (3)

Country Link
EA (1) EA038241B1 (fr)
RU (1) RU2698916C1 (fr)
WO (1) WO2020185110A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090055359A1 (en) * 2007-08-14 2009-02-26 John Nicholas Gross News Aggregator and Search Engine Using Temporal Decoding
US20120158711A1 (en) * 2003-09-16 2012-06-21 Google Inc. Systems and methods for improving the ranking of news articles
US20130097279A1 (en) * 2006-06-27 2013-04-18 Jared Polis Aggregator with managed content
US20160371344A1 (en) * 2014-03-11 2016-12-22 Baidu Online Network Technology (Beijing) Co., Ltd Search method, system and apparatus

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7762453B2 (en) * 1999-05-25 2010-07-27 Silverbrook Research Pty Ltd Method of providing information via a printed substrate with every interaction
US7293019B2 (en) * 2004-03-02 2007-11-06 Microsoft Corporation Principles and methods for personalizing newsfeeds via an analysis of information novelty and dynamics
US9384211B1 (en) * 2011-04-11 2016-07-05 Groupon, Inc. System, method, and computer program product for automated discovery, curation and editing of online local content
RU2629449C2 (ru) * 2014-05-07 2017-08-29 Общество С Ограниченной Ответственностью "Яндекс" Устройство, а также способ выбора и размещения целевых сообщений на странице результатов поиска
RU2608884C2 (ru) * 2014-06-30 2017-01-25 Общество С Ограниченной Ответственностью "Яндекс" Реализуемый компьютером способ обеспечения графического пользовательского интерфейса на экране дисплея электронного устройства браузерным контекстным помощником (варианты), сервер и электронное устройство, используемые в нем

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120158711A1 (en) * 2003-09-16 2012-06-21 Google Inc. Systems and methods for improving the ranking of news articles
US20130097279A1 (en) * 2006-06-27 2013-04-18 Jared Polis Aggregator with managed content
US20090055359A1 (en) * 2007-08-14 2009-02-26 John Nicholas Gross News Aggregator and Search Engine Using Temporal Decoding
US20160371344A1 (en) * 2014-03-11 2016-12-22 Baidu Online Network Technology (Beijing) Co., Ltd Search method, system and apparatus

Also Published As

Publication number Publication date
EA038241B1 (ru) 2021-07-29
RU2698916C1 (ru) 2019-09-02
EA201990538A1 (ru) 2020-09-30

Similar Documents

Publication Publication Date Title
US11663254B2 (en) System and engine for seeded clustering of news events
US11663405B2 (en) Machine learning applications for temporally-related events
US10977447B2 (en) Method and device for identifying a user interest, and computer-readable storage medium
Shahana et al. Evaluation of features on sentimental analysis
US9002848B1 (en) Automatic incremental labeling of document clusters
CN106886567B (zh) 基于语义扩展的微博突发事件检测方法及装置
US20170075983A1 (en) Subject-matter analysis of tabular data
US20180225372A1 (en) User classification based on multimodal information
US10002187B2 (en) Method and system for performing topic creation for social data
WO2012135319A1 (fr) Traitement de données dans un cadre d'application mapreduce
Alami et al. Cybercrime profiling: Text mining techniques to detect and predict criminal activities in microblog posts
US10949418B2 (en) Method and system for retrieval of data
US20170109358A1 (en) Method and system of determining enterprise content specific taxonomies and surrogate tags
US10565311B2 (en) Method for updating a knowledge base of a sentiment analysis system
US20150081601A1 (en) Automatic generation of preferred views for personal content collections
WO2012096388A1 (fr) Système de détermination de caractère inattendu, procédé de détermination de caractère inattendu et programme
US9996529B2 (en) Method and system for generating dynamic themes for social data
CA2956627A1 (fr) Systeme et moteur servant au regroupement cible d'evenements d'informations
Aghaei et al. Ensemble classifier for misuse detection using N-gram feature vectors through operating system call traces
WO2015084757A1 (fr) Systèmes et procédés de traitement de données stockées dans une base de données
Loynes et al. The detection and location estimation of disasters using Twitter and the identification of Non-Governmental Organisations using crowdsourcing
WO2023129339A1 (fr) Extraction et classification d'entités à partir d'articles de contenu numérique
Peng et al. Trending sentiment-topic detection on twitter
CN110019763B (zh) 文本过滤方法、***、设备及计算机可读存储介质
CN111984797A (zh) 客户身份识别装置及方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19919152

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19919152

Country of ref document: EP

Kind code of ref document: A1