WO2020185110A1 - Procédé et système de recherche de nouvelles pertinentes - Google Patents
Procédé et système de recherche de nouvelles pertinentes Download PDFInfo
- Publication number
- WO2020185110A1 WO2020185110A1 PCT/RU2019/000162 RU2019000162W WO2020185110A1 WO 2020185110 A1 WO2020185110 A1 WO 2020185110A1 RU 2019000162 W RU2019000162 W RU 2019000162W WO 2020185110 A1 WO2020185110 A1 WO 2020185110A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- news
- lemmas
- company
- events
- event
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
Definitions
- the present technical solution generally relates to the field of information technology, and in particular to search engines designed to identify relevant information from heterogeneous data sources.
- a common drawback of existing approaches is the lack of a method for identifying relevant news in relation to binding to a news object, for example, a company and a corresponding event associated with it, which does not allow efficient collection of relevant information from multiple data sources.
- the technical problem or technical problem to be solved using the claimed approach is to provide a process for searching and generating a set of news with reference to a given set of company names as news objects and events about which information appears in open data sources.
- the technical result achieved when solving the above technical problem is to provide the formation of a related set of information from news sources with a grouping by companies that are the object of news and given types of events.
- An additional technical result is to improve the accuracy of identifying information about companies for a given type of event in publicly available information sources.
- the filtering is performed by calculating the Jacquard measure between news signatures.
- the assigned company news is stored in a database.
- the news aggregator updates the news list using information channels.
- the information channels are web sites on the Internet and / or messenger channels.
- the ownership of the company mentioned in the news is determined using a decision tree algorithm.
- a statistical measure is calculated for each lemma of the news text during lemmatization.
- the machine learning algorithm is a logical regression, classifying the belonging of the news to the event based on the analysis of the statistical measure of the lemmas.
- the machine learning algorithm is a gradient boosting trained to classify an event based on the number of sentences containing lemmas identifying an event from a search query.
- a relevant news search system comprising at least one processor and at least one memory that contains machine-readable instructions that, when executed by at least one processor, perform the above method.
- FIG. 1 illustrates the interaction of the elements included in the claimed solution.
- FIG. 2 illustrates the general flow of the method.
- FIG. 3 illustrates the processing of text data.
- FIG. 4 shows an example of a graphical user interface when interacting with a service for the selection of relevant news.
- FIG. 5 illustrates a general view of a computing device.
- FIG. 1 shows the general computing architecture (100) of the presented solution.
- Main functionality for collection and processing information is executed on a control server (ON), which, through a data transmission channel, receives information from a news aggregator server (120), which is connected via the Internet (150) to a plurality of news resources (130).
- the server (software) provides interaction with users (10) to display data on the collected news information, as well as additional functionality, which will be disclosed later in the application materials.
- the Internet or Intranet can be used as a data transmission channel between the management server (110) and the server of the news aggregator (120).
- the server of the news aggregator (120) can represent several devices that are part of a different network environment, for example, a set of servers, routers, clusters, etc.
- a data transmission channel can be organized using various types of known data transmission protocols, both wired and wireless, for example, TCP / IP, 802.11, Ethernet, FTP, etc., providing the formation of various network interactions, in particular LAN, WAN, PAN , WLAN, etc.
- the Management server performs the main processing of information received from the server of the news aggregator (120), stores and generates data for display to users (10).
- Information display can be formed using a specialized graphical user interface.
- Users (10) can interact with the control server (110) using a web portal or other type of software application that provides access to aggregated news information. Access can be provided, for example, through an API.
- the interaction of users (10) can be carried out using various electronic devices, which can be, for example, a computer, laptop, smartphone, tablet, game console, smart wearable electronic device, thin client, as well as devices of augmented, mixed or virtual reality and dr.
- the server of the news aggregator (120) is connected via the Internet (150) with various information resources (130) or information channels providing news information.
- Such resources (130) can be, for example, websites, messenger channels (Telegram TM, WatsApp TM, Viber TM, etc.), social networks (Facebook TM, V.e TM, etc.).
- Saving the received information on the server (110) can be carried out in JSON format in a data store, for example, a database. In this case, the source of obtaining news information and the date of its placement on the corresponding resource can be taken into account (130).
- FIG. 2 shows a general process for performing the claimed method of searching for relevant news information (200).
- Information from news sources, collected and stored on the server of the news aggregator (120) is transmitted (201) to the control server (software).
- Information from the server of the news aggregator (120) can be transmitted online or offline.
- In online mode data from the Internet (150) is transmitted as soon as it appears on a web resource to which the news aggregator server has a connection (120).
- news is stored on the news aggregator server (120), for example, in a database, and at a set time (for example, every hour, once a day, etc.) or upon request from the control server (software) are transmitted to it ...
- Data from the news aggregator server (120) can be transmitted in various formats, for example, xml, html, txt, and the like.
- the data format for transmission can also change depending on the mode of information transmission to the control server (1-10).
- the news data contains information about the companies mentioned in the text.
- the search for relevant information according to the data received from the server of the news aggregator (120) is carried out by processing (202) the obtained data array using a machine learning model, which is trained to search by company names (2021) and corresponding events (2022) in the array textual information and make judgments about the relevance of the relevant information.
- Data processing at server (110) is performed upon receipt of a new data array from the news aggregator server (120), or according to a predetermined scenario.
- An automatic script can be configured as a script, which at a set time activates a machine learning model for data processing (202).
- an access is made to the information store of the management server (UE), which contains data received from the news aggregator server (120) from news sources (130).
- UE management server
- the information stored on the control server (HO) it is processed (202) to identify relevant data and bind data (203) from news to the corresponding types of events during information processing using a machine learning model.
- FIG. 3 shows a process (300) for processing news data received from a news aggregator server (120), which is performed during steps (202) - (203).
- the news text data received from the server of the news aggregator (120) is lemmatized, during which the text corpus of each news is divided into lemmas. From the received data, news text and metadata from files are extracted.
- the body of the news is divided into words by all punctuation separators, and then reduced to normal form, for example, using the pymorphy2 library. Then the text is converted, in particular, the text is cleared of punctuation marks, stop words (prepositions, conjunctions, pronouns) and nominal entities. In this case, a nominal entity is considered to be any word that begins with a capital letter and is not the first word in a sentence. Also, the process of N-gramming (https://ru.wikipedia.org/wiki/N-rpaMMa) can be performed, in which the most frequent word combinations of length from 2 to 10 lemmas are highlighted in the text. The list of the most frequent phrases was obtained by automatic analysis of a large body of text and contains more than 9 million objects.
- incoming news goes through a deduplication process, during which duplicate news is filtered out.
- the MinHash signature is calculated for each news item (see. https://en.wikipedia.org/wiki/MinHash), after which, for each pair of news, the similarity of signatures is calculated according to the Jaccard measure (sometimes the Jaccard coefficient). If the similarity of a pair of news exceeds the specified threshold, for example, 0.7, then the shorter news from the pair of text corpuses is considered duplicate and is not subjected to further processing.
- next step (302) after lemmatization of the news texts, processing of the normalized text is performed.
- the text of the news is searched for the names of companies that do not have homonyms (for example, Sberbank TM). All phrases with a capital letter and in quotes are found, after which the lemmas of the found phrases are searched in the list of companies stored in the server database (110).
- the classification threshold is 0.5.
- the received lemmas from the news body are processed at step (303) using machine learning models.
- logical regression can be applied by calculating the TF-IDF statistical measure for text lemmas (see https://ru.wikipedia.org/wiki/TF-IDF). For each lemma in the text, a statistical measure is calculated, after which a pre-trained logistic regression judgment is made on the obtained features.
- a list of given lemmas is compiled, for example, a list may contain 30-40 lemmas that have most weight in logistic regression. The list is built for each event after the logistic regression learning process.
- the weight of each lemma is determined, according to which the selection of lemmas for the list is carried out based on the values of their weights.
- a set of lemmas can be specified, for example, 10-15 lemmas, which are most appropriate for the event, which are selected from a previously defined list of lemmas, and if the event was assigned to news at step (304), then all lemmas found in the text from mentioned set are highlighted in the text.
- a second example of an application of the machine learning model is a gradient boosting classifying algorithm such as LightGBM (https://lightgbm.readthedocs.io). For each news text, the number of sentences containing pairs of lemmas characteristic of the event is counted. Pairs of characteristic lemmas are selected for each event during the training of the classifier. The characteristic lemmas (and their number) are selected automatically during training.
- LightGBM https://lightgbm.readthedocs.io
- FIG. 4 shows an example of a graphical user interface (400) for interacting with a service for the selection of relevant news information.
- the interface (400) provides functionality for displaying and managing the content of the provided data.
- the search query is formed using the name information input panel companies (401).
- the main field (404) for displaying current or found information a list of companies is presented for which processing of identifying relevant information from the server database (110) is carried out.
- Companies in field (404) can be displayed in a different hierarchical order, for example, alphabetically, by the number of news, and the like.
- the information can be filtered by the time range, which is set in the date input field (402).
- the interface (400) contains a control panel for setting parameters of search queries (403). Using the control panel (403), you can configure the detection of certain types of events, link companies, configure service parameters, etc.
- the field (405) displays a list of identified news sources in accordance with the specified events for the companies.
- Users (10) can also set an alert function for selected company names. Notifications about the arrival of new news can be sent via e-mail messages, PUSH notifications, SMS notifications, etc.
- the user (10) can configure the required parameters, for example, the name of the company, the type of events related to companies.
- the generated information on processed news can also be displayed using a filter customized with respect to the role of the user (10) interacting with the interface (400). Taking into account the parameter of the user account (10), only those news that contain the type of events associated with his role can be displayed to him.
- FIG. 5 shows an example of a general view of the device (500), which provides the implementation of the presented solution.
- a different range of computing devices can be implemented on the basis of the device (500), for example, a control server (110), a news aggregator server (120), user devices (10), etc.
- the device (500) contains one or more processors (501) united by a common data exchange bus, memory means such as RAM (502) and ROM (503), input / output interfaces (504), input / output (505), and a device for networking (506).
- the processor (501) (or multiple processors, multi-core processor, etc.) can be selected from a range of devices currently widely used, for example, manufacturers such as: Intel TM, AMD TM, Apple TM, Samsung Exynos TM, MediaTEK TM, Qualcomm Snapdragon TM, etc.
- the graphics processor for example, NVIDIA GPU or Graphcore, the type of which is also suitable for full or partial execution of the method (200), and can also be used for training and applying machine models. training in various information systems.
- RAM (502) is a random access memory and is intended for storing machine-readable instructions executed by the processor (501) for performing the necessary operations for logical data processing.
- RAM (502) typically contains executable instructions of an operating system and associated software components (applications, software modules, etc.). In this case, the available memory of the graphics card or graphics processor can act as RAM (502).
- ROM (503) is one or more means for permanent storage of data, for example, hard disk drive (HDD), solid state data storage device (SSD), flash memory (EEPROM, NAND, etc.), optical storage media (CD-R / RW, DVD-R / RW, BlueRay Disc, MD), etc.
- I / O interfaces are used to organize the operation of the components of the device (500) and to organize the operation of external connected devices.
- the choice of the appropriate interfaces depends on the specific version of the computing device, which can be, but are not limited to: PCI, AGP, PS / 2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS / Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232, etc.
- I / O information are used, for example, a keyboard, display (monitor), touch display, touch-pad, joystick, mouse manipulator, light pen, stylus, touchpad, trackball, speakers, microphone, augmented reality, optical sensors, tablet, light indicators, projector, camera, biometric identification (retina scanner, fingerprint scanner, voice recognition module), etc.
- the networking tool (506) provides data transmission via an internal or external computer network, for example, Intranet, Internet, LAN and the like.
- One or more means (506) may be used, but not limited to: Ethernet card, GSM modem, GPRS modem, LTE modem, 5G modem, satellite communication module, NFC module, Bluetooth and / or BLE module, Wi-Fi module and dr.
- satellite navigation aids can be used as part of the device (500), for example, GPS, GLONASS, BeiDou, Galileo.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
La présente invention se rapporte au domaine des techniques d'informations et concerne notamment des mécanismes de recherche servant à découvrir des informations pertinentes à partir de sources de données de divers types. Le résultat technique consiste en la génération d'un ensemble lié d'informations à partir de sources de nouvelles avec un regroupement en fonction des sociétés qui sont à l'origine des nouvelles et selon des types d'évènements donnés. Dans un premier mode de réalisation préféré, la présente invention concerne un procédé mis en œuvre par ordinateur de recherche de nouvelles pertinentes, qui consiste à obtenir sur un serveur de commande un ensemble de nouvelles à partir d'au moins un serveur d'agrégateur de nouvelles; effectuer sur le serveur de commande une analyse de l'ensemble des nouvelles obtenu qui comprend une lemmatisation des textes de chaque nouvelle depuis ledit serveur de nouvelles; traiter les lemmes obtenus des textes de nouvelles à l'aide d'un modèle d'apprentissage machine qui comprend un ensemble prédéfini de données de sociétés et une liste des évènements, un ensemble donné de lemmes étant établi pour chaque évènement dans le modèle d'apprentissage machine; déterminer les nouvelles contenant des lemmes identifiant des évènements donnés et générer un lien des évènements découverts avec au moins une société; puis générer une liste des nouvelles pertinentes sur la base de l'analyse effectuée.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
RU2019107328 | 2019-03-14 | ||
RU2019107328A RU2698916C1 (ru) | 2019-03-14 | 2019-03-14 | Способ и система поиска релевантных новостей |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020185110A1 true WO2020185110A1 (fr) | 2020-09-17 |
Family
ID=67851403
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/RU2019/000162 WO2020185110A1 (fr) | 2019-03-14 | 2019-03-14 | Procédé et système de recherche de nouvelles pertinentes |
Country Status (3)
Country | Link |
---|---|
EA (1) | EA038241B1 (fr) |
RU (1) | RU2698916C1 (fr) |
WO (1) | WO2020185110A1 (fr) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090055359A1 (en) * | 2007-08-14 | 2009-02-26 | John Nicholas Gross | News Aggregator and Search Engine Using Temporal Decoding |
US20120158711A1 (en) * | 2003-09-16 | 2012-06-21 | Google Inc. | Systems and methods for improving the ranking of news articles |
US20130097279A1 (en) * | 2006-06-27 | 2013-04-18 | Jared Polis | Aggregator with managed content |
US20160371344A1 (en) * | 2014-03-11 | 2016-12-22 | Baidu Online Network Technology (Beijing) Co., Ltd | Search method, system and apparatus |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7762453B2 (en) * | 1999-05-25 | 2010-07-27 | Silverbrook Research Pty Ltd | Method of providing information via a printed substrate with every interaction |
US7293019B2 (en) * | 2004-03-02 | 2007-11-06 | Microsoft Corporation | Principles and methods for personalizing newsfeeds via an analysis of information novelty and dynamics |
US9384211B1 (en) * | 2011-04-11 | 2016-07-05 | Groupon, Inc. | System, method, and computer program product for automated discovery, curation and editing of online local content |
RU2629449C2 (ru) * | 2014-05-07 | 2017-08-29 | Общество С Ограниченной Ответственностью "Яндекс" | Устройство, а также способ выбора и размещения целевых сообщений на странице результатов поиска |
RU2608884C2 (ru) * | 2014-06-30 | 2017-01-25 | Общество С Ограниченной Ответственностью "Яндекс" | Реализуемый компьютером способ обеспечения графического пользовательского интерфейса на экране дисплея электронного устройства браузерным контекстным помощником (варианты), сервер и электронное устройство, используемые в нем |
-
2019
- 2019-03-14 RU RU2019107328A patent/RU2698916C1/ru active
- 2019-03-14 WO PCT/RU2019/000162 patent/WO2020185110A1/fr active Application Filing
- 2019-03-19 EA EA201990538A patent/EA038241B1/ru unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120158711A1 (en) * | 2003-09-16 | 2012-06-21 | Google Inc. | Systems and methods for improving the ranking of news articles |
US20130097279A1 (en) * | 2006-06-27 | 2013-04-18 | Jared Polis | Aggregator with managed content |
US20090055359A1 (en) * | 2007-08-14 | 2009-02-26 | John Nicholas Gross | News Aggregator and Search Engine Using Temporal Decoding |
US20160371344A1 (en) * | 2014-03-11 | 2016-12-22 | Baidu Online Network Technology (Beijing) Co., Ltd | Search method, system and apparatus |
Also Published As
Publication number | Publication date |
---|---|
EA038241B1 (ru) | 2021-07-29 |
RU2698916C1 (ru) | 2019-09-02 |
EA201990538A1 (ru) | 2020-09-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11663254B2 (en) | System and engine for seeded clustering of news events | |
US11663405B2 (en) | Machine learning applications for temporally-related events | |
US10977447B2 (en) | Method and device for identifying a user interest, and computer-readable storage medium | |
Shahana et al. | Evaluation of features on sentimental analysis | |
US9002848B1 (en) | Automatic incremental labeling of document clusters | |
CN106886567B (zh) | 基于语义扩展的微博突发事件检测方法及装置 | |
US20170075983A1 (en) | Subject-matter analysis of tabular data | |
US20180225372A1 (en) | User classification based on multimodal information | |
US10002187B2 (en) | Method and system for performing topic creation for social data | |
WO2012135319A1 (fr) | Traitement de données dans un cadre d'application mapreduce | |
Alami et al. | Cybercrime profiling: Text mining techniques to detect and predict criminal activities in microblog posts | |
US10949418B2 (en) | Method and system for retrieval of data | |
US20170109358A1 (en) | Method and system of determining enterprise content specific taxonomies and surrogate tags | |
US10565311B2 (en) | Method for updating a knowledge base of a sentiment analysis system | |
US20150081601A1 (en) | Automatic generation of preferred views for personal content collections | |
WO2012096388A1 (fr) | Système de détermination de caractère inattendu, procédé de détermination de caractère inattendu et programme | |
US9996529B2 (en) | Method and system for generating dynamic themes for social data | |
CA2956627A1 (fr) | Systeme et moteur servant au regroupement cible d'evenements d'informations | |
Aghaei et al. | Ensemble classifier for misuse detection using N-gram feature vectors through operating system call traces | |
WO2015084757A1 (fr) | Systèmes et procédés de traitement de données stockées dans une base de données | |
Loynes et al. | The detection and location estimation of disasters using Twitter and the identification of Non-Governmental Organisations using crowdsourcing | |
WO2023129339A1 (fr) | Extraction et classification d'entités à partir d'articles de contenu numérique | |
Peng et al. | Trending sentiment-topic detection on twitter | |
CN110019763B (zh) | 文本过滤方法、***、设备及计算机可读存储介质 | |
CN111984797A (zh) | 客户身份识别装置及方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19919152 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19919152 Country of ref document: EP Kind code of ref document: A1 |