WO2022025750A1

WO2022025750A1 - Person profile finder using semantic similarity measurement of object based on internet source and related keywords

Info

Publication number: WO2022025750A1
Application number: PCT/MY2020/050167
Authority: WO
Inventors: Amru Yusrin AMRUDDIN; May Fern KOH; Nardiatul Kasmi MOHAMED KASSIM; Muhammad Awis Jamaluddin JOHARI; Wooi Kin Goon
Original assignee: Mimos Berhad
Priority date: 2020-07-30
Filing date: 2020-11-24
Publication date: 2022-02-03

Abstract

The present invention provides a method for profiling an object based on search input. The method comprises receiving (305) the search input of the object to be profiled, the inputs include keywords; harvesting (310) data from internet through a data harvesting bot (110); rotating keywords and pages through a spiral keyword processor (120) for hopping between targeted pages/sites for avoiding anti-bot mechanisms on the targeted pages/sites; identifying (330) data relevancy based on semantic similarity of the keywords to get most relevant data from the harvested data; identifying (340) the keywords through named entity recognition, NER, processor (140) to extract most relevance data; and outputting (345) the profile of the object in a structured manner with highly relevant data. An object profiling system is also provided.

Description

PERSON PROFILE FINDER USING SEMANTIC SIMILARITY MEASUREMENT OF OBJECT BASED ON INTERNET SOURCE AND RELATED KEYWORDS

Field of the Invention

[0001] The present invention relates to personal profile forming. In particular, the present invention relates to a system and method for compiling personal profile information using semantic similarity measurement of object based on internet source and related keywords.

Background

[0002] In year 2019, there are estimated 2.95 billion people using social media around the world (published by J. Clement, Apr 1, 2020, on https://www.statista.com). Given any random name of a person, there is no surprise that there are tons of hits from the search result, unless the name is a unique one.

[0003] Platform dependent search engines only provide search of a person information within what was provided either by the person himself or by other person, i.e. single source. When the search (of a person) is conducted in the opened web, the search engine will return with all the hits for the input search terms as individual links. User is required to filter the results manually.

[0004] US Patent Publication no. US2014/0115053A1 discloses a method of finding members of the social media network having common interests and may find use in such social media as common interests, location, profession, age, rank, player skill, etc. The users are only exposed to each other when certain conditions are met. This system is a closed network, whereby the person search is based only on the members of the network. The information regarding the person is as good as what was provided by the members of the network.

[0005] Canadian Patent Publication no. CA2437456A1 offers a match making system searching for new match. Similarly, the system provide searches over closed networks. The search results are presented as the way it was provided by the owner of the profile.

[0006] European Patent No. EP159631B1 discloses a search engine that search the Internet (opened web). After a user submits a search request (also referred to as a "query") that includes search terms, the search engine service identifies web pages that may be related to those search terms. The search engine service then displays to the user the links to those web pages in an order that is based on their relevance. The user is to browse through the search results and filter the information manually.

[0007] There existing a need for system and method to facilitate search of a person over an open web that collect and compile the information relating to the person of interest.

Summary

[0008] In one aspect of the present invention, there is provided a method for profiling an object based on search input. The method comprises receiving the search input of the object to be profiled, the inputs include keywords; harvesting data from internet through a data harvesting bot; rotating keywords and pages through a spiral keyword and page rotation engine for hopping between targeted pages/sites for avoiding anti-bot mechanisms on the targeted pages/sites; establishing data relevancy based on semantic similarity of the keywords to get most relevant data from data harvested; classifying the keywords through named entity recognition, NER, module to extract most relevance data; and outputting the profile of the object in a structured manner with highly relevant data. [0009] In one embodiment, the step of establishing data relevancy further comprises determining semantic information of an object pair in a matrix from the internet; calculating overlap semantic information; and outputting semantic similarity value on the matrix.

[0010] In another embodiment, the method further comprises tokenizing keywords; comparing each token against language service; and tagging each keyword with its language.

[0011] In yet another embodiment, the method further comprises querying the

Wikipedia based on the detected language for each tokenized keyword; querying machine readable dictionary, MRD, and thesaurus based on the tokenized keyword; and querying Wordnet based on the tokenized keyword; and storing all hypertext linked words extracted from the returned page and the synonym words retrieved from MRD, thesaurus and Wordnet on a keyword repository.

[0012] In another aspect of the present invention, there is provided an object profiling system for profiling an object based on search input. The system comprises a GUI for receiving input of the object to be profiled, the input includes keywords; a data harvesting bot for harvesting data from internet; a spiral keyword and page rotation engine, operationally hopping between targeted pages/sides for avoiding anti-bot mechanism on the targeted pages/sites; a semantic similarity module for establishing data relevancy of most relevance at from many data harvested; a named-entity recognition, NER, module for classifying the keywords to extract most relevance data; and an output of the profile of the object in a structured manner with the most relevant data. [0013] In one embodiment, the semantic similarity module is adapted to operationally determine semantic information of an object pair in a matrix from the internet, calculating overlap semantic information and outputting semantic similarity value on the matrix. [0014] In another embodiment, the semantic similarity module is adapted for tokenizing keywords; comparing each token against language service; and tagging each keyword with its detected language;

[0015] In a further embodiment, the semantic similarity module is adapted for querying the Wikipedia on detected language for each tokenized keyword and machine readable dictionary, MRD, and thesaurus and Wordnet based on tokenized keyword, and storing all hypertext linked words extracted from the returned page and the synonym words retrieved from MRD, thesaurus and Wordnet on a keyword repository. Brief Description of the Drawings

[0016] This invention will be described by way of non-limiting embodiments of the present invention, with reference to the accompanying drawings, in which:

[0017] FIG.1 illustrates a block diagram of a person profile finder engine 100 in accordance with an embodiment of the present invention; [0018] FIG. 2 exemplifies a screenshot for proving inputs in accordance with an embodiment of the present invention;

[0019] FIG. 3 illustrates a person profile search process in accordance with an embodiment of the present invention; [0020] FIG. 4 illustrate a schematic diagram of the web crawling in accordance with an embodiment of the present invention;

[0021] FIG. 5 illustrates diagrammatic representation of a spiral keywords and pages rotation engine in accordance with an embodiment of the present invention;

[0022] FIG. 6 exemplifies a comparison of data captured and processed with a conventional method and spiral keyword and page rotation engine;

[0023] FIG. 7 illustrates a matrix of objects for semantic similarity comparison in accordance with an embodiment of the present invention;

[0024] FIG. 8 shows a process for obtaining semantic information for object in accordance with an embodiment of the present invention; [0025] FIG. 9 illustrates a scoring process of semantic similarity value in accordance with an embodiment of the present invention;

[0026] FIG. 10 illustrates schematic diagram of the keyword semantic similarity generation in accordance with one embodiment of the present invention;

[0027] FIG. 11 illustrates a process for detecting language of a keyword in accordance with an embodiment of the present invention; [0028] FIG. 12 illustrates a process of generating related keyword in accordance with an embodiment of the present invention;

[0029] FIG. 13 illustrates a calculation of overlapped semantic information in accordance with an embodiment of the present invention; and [0030] FIG. 14 exemplifies a search results obtained by the present system and method in accordance with the above embodiments of the present invention.

Detailed Description

[0031] In line with the above summary, the following description of a number of specific and alternative embodiments are provided to understand the inventive features of the present invention. It shall be apparent to one skilled in the art, however that this invention may be practiced without such specific details. Some of the details may not be described at length so as not to obscure the invention. For ease of reference, common reference numerals will be used throughout the figures when referring to the same or similar features common to the figures. [0032] Embodiments of the present invention include various steps, which will be described below. The steps may be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause a general- purpose or special-purpose processor programmed with the instructions to perform steps. Alternatively, steps may be performed by a combination of hardware, software, firmware, and/or by human operators.

[0033] Embodiments of the present invention may be provided as a computer program product, which may include a machine-readable storage medium tangibly embodying thereon instructions, which may be used to program a computer (or other electronic devices) to perform a process. The machine-readable medium may include, but is not limited to, data storage drives be it magnetic or optical disks, and/or semiconductor-based memories, such as RAMs, ROMs, flash memory, etc., for storing digital instructions executable by any processing devices.

[0034] Various methods described herein may be practiced by combining one or more machine -readable storage media containing the code according to the present invention with appropriate standard computer hardware to execute the code contained therein. An apparatus for practicing various embodiments of the present invention may involve one or more computers (or one or more processors within a single computer) and storage systems containing or having network access to computer program(s) coded in accordance with various methods described herein, and the method steps of the invention could be accomplished by modules, routines, subroutines, or subparts of a computer program product. [0035] The present invention is implementable on both local and remote computing device. A local computer include a personal computer or a handheld mobile device. A remote computer may be a remote server or cloud computing device, which includes virtual machine, which generally refers to a self-contained operating environment that behaves as if it is a separate computer even though is part of a separate computer or may be virtualized using resources form multiple computers.

[0036] In one embodiment, software applications can be developed based on available platforms or platform products to build new functionality as an extension to the existing functionality. For example, a platform can be provided by a third party, e.g. a cloud computing platform that allows provisioning of the software application in a cloud environment.

[0037] Cloud computing is an internet-based computing that provides shared processing of resources and data to computers and other devices based on demand. The cloud computing provides access to the resources like networks, servers, storage, applications and services. These resources can be rapidly provisioned and released with minimal management effort in the cloud computing.

[0038] The embodiments of the present invention enable user to search a person profile in single easy to use user interface and return a list of profile information processed based on internet sources. The results are derived based on measurements of the hits found from internet sources and semantic similarity value that changes dynamically from time to time. The semantic similarity is determined by meaning of keyword extracted from search results and their related keywords such as calculation on difference between two objects is based on the likeness of their meaning, as opposed to their syntactical or visual representation (e.g. the string format, the shape of object, etc.). However, there is a situation where the meaning of the two objects are different, and yet they are similar in many aspect. For example, the similarities of “Anwar Ibrahim” and “Dr. Mahathir” include that they were in a same coalition party, both were great leader in Malaysia, both were politicians and ex-UMNO member etc.; but they were different in that Anwar had never been a Prime Minister and had been jailed while Dr. Mahathir had been the Prime Minister twice and had never been jailed. With a named entity recognition (NER) process, semantic similarity of the two named objects can be identify and the results is refined to give high accuracy result to user. [0039] It is therefore an objective to provide a system and method for person profile forming based on information found over the opened web. The system includes a data harvesting crawler for gathering data from the internet, a spiral keywords and pages rotation processor adapted for bypassing captcha or any anti-bot means during data harvesting, a semantic similarity module for identifying data relevancy and to get most relevance data from many data harvested, and a named-entity recognition (NER) generate entity into relevant pre-defined structured data. In one embodiment, the measurement is based on Internet source, and semantic similarity value will change dynamically over time. More specifically, semantic similarity is determined by meaning of keyword extracted from search results and their related keywords.

[0040] FIG.1 illustrates a block diagram of a person profile finder system 100 in accordance with an embodiment of the present invention. The person profile finder system 100 comprises a data harvesting bot 110, a spiral keyword processor 120, a semantic similarity processor 130 and a named-entity recognition (NER) processor 140 to process an input 101 and to output the processed results 102. The input 101 is user’s input that includes any of object of interest’ s name, photo, email, phone number etc. The processed results 102 is basically a compiled search results of the object of interest. The data harvesting bot 110 is a data crawler or web crawler or web data crawler for systematically browsing the world wide web (WWW) to gather data in relation to the input 101. The spiral keyword processor 120 provides a first round of processing adapted to bypass the captcha or firewall while the bot 110 is crawling the data. The semantic similarity processor 130 processes the data crawled to extract those that is most relevant to the input 101. The NER processor 140 processes the data and generate entities from the relevant data. [0041] FIG. 2 exemplifies a screenshot for proving inputs in accordance with an embodiment of the present invention. The screenshot 200 facilitate fields for the users to input information regarding their object of interest. In this screenshot 200, there are name, email, phone and photo fields for users to fill in. In other embodiment, other fields may also be available for refining search results.

[0042] FIG. 3 illustrates a person profile search process in accordance with an embodiment of the present invention. The process 300 is carried out by the engine of a system 100 in FIG. 1. Briefly, the process 300 comprises the steps of receiving inputs from user at step 305; harvesting data over the internet at step 310; spiral keywords and pages rotation at step 320; identifying data relevancy based on semantic similarity at step

330; and identifying keywords through NER processing at step 340.

[0043] Returning to the step 305, the system 100 acquires user input through a graphical user interface (GUI) comprising fields for user input. Example of the GUI is illustrated in FIG. 2. User provides one or more input information in the filed through the GUI, and once done, the user trigger to initiate the person profile processing. At the step 310, the data harvesting bot 110 crawling the WWW to perform a search based on the supplied information. The search may be carried out through any search engine, such as Google, Yahoo, Bing and/or any other public or proprietary engines. As the search engine is performing the search, the spiral keywords and pages rotation method is carried out as and when necessary to bypass captcha or firewall built on the source of the information found. So far, the steps involving gathering the data relevant to the search input. [0044] Referring now to the step 330, the system 100 performs the semantic similarity processing over the gathered data to extract only the relevant data in relation to the object of interest, i.e. filter-out irrelevant data. At the step 340, the system further performs NER processing through a NER processor 140 to generate entities of high relevance data. The entities are compiled and output in a prescribed format to the user at the step 345.

[0045] FIG. 4 illustrate a schematic diagram of the web crawling in accordance with an embodiment of the present invention. The web crawling can be executed by the data harvesting bot 110 in FIG. 1. As provided earlier, the data harvesting bot search the object of interest over the web. Over the WWW, the harvesting bot 110 searches through the web resources that contains inputted information concerning the object of interest. The web resources include any of the news, webpages, blogs, eBooks, videos, Instagram, images, Linkedln, phone lookup, Facebook, Twitter, and many more. All the hits (results) are stored for further processing. [0046] FIG. 5 illustrates diagrammatic representation of a spiral keywords and pages rotation engine in accordance with an embodiment of the present invention. Briefly, the spiral keywords and page rotation method is performed to avoid anti-bot mechanism by briefly hopping from one targeted server to another targeted server, back again then hop again to extract all the required data. In one embodiment, durations of each hop may change at random from a range of time set in a configuration file, for example 10-30 seconds . Such random delay mechanism can effectively avoid, most, if not all, of the anti-bot mechanism or firewall adapted to prevent data crawling. As shown in the diagram, it comprises a spiral centrally disposed on a pane of four segments parted by two perpendicularly crossed axes. In this embodiment, the four segments are Blog, social media, news and webpages respectively arranged in a clockwise manner. The spiral line indicates the sequence of hopping between the four segments, wherein each loop of the spiral indicates a cycle of the system hopping from Web > Blog > Social Media > News with the keywords. [0047] As shown, a Keyword 1 is sent as a query to a website to crawl for data, and for a short stay in that website, the engine hops into a Blog page, then a social media and then news page, then back to a website, so on and so forth. For each hop, the engine gathers some data found in relation to the Keyword 1. Once Keyword 1 is completed, the system send Keyword 2 until all the keywords are exhausted. [0048] FIG. 6 exemplifies a comparison of data captured and processed with a conventional method and spiral keyword processor 120. The table on the top exemplifies data that are processed through the conventional method, whereby any data crawler bot stays too long in one category in result will be block by anti-bot ware such as captcha or firewall. Basically, the old data harvesting bot search based on the list of pages/sites found and crawls the pages sequentially. Typically, the bot stays on one particular page or site until it harvests all the required information in relation to the keyword in question. In this table, “Malaysia” is the keyword, and it took 2 time-units to harvest the required data on Web-Google, then it moves on to Blog-Blogspot with the same keyword “Malaysia”. Similarly, it took 2 time-units over in Blog-Blogspot before it moves on to Social Media-Facebook. In Social Media-Facebook it too stayed for 2 time-units, so on and so forth.

[0049] The time unit spent on each page or sides would depends on the size of each page and the associated pages which include linked pages found in that page in question. If there are many pages associated to the targeted page/site, the data harvesting hot 110 will stay on the targeted server hosting the targeted page for a considerable amount of time, which will trigger anti-bot mechanism to block the data crawling. Usually, data crawling stops at that point. [0050] The table below in FIG. 6 exemplifies data obtained by the data harvesting bot 110 through the spiral keyword processor 120. The data harvesting bot 110 crawls the list of targeted pages/sites spirally, such that the data harvesting bot 110 stays on each page/site for only 1 time unit, then another for also 1 time unit and so on. Through the hopping of web servers, the data harvesting bot 110 is being treated having some form of delay in activity, which is how a real human behave.

[0051] In the table, the data harvesting bot 110 sends “Malaysia” as the keyword to Web-Google and it took 1 time-units to harvest any data harvestable within the time unit then hops to Blog-Blogspot with the same keyword to harvest whatever it can in another time unit then hops Social media -Facebook for another time unit and then News- Bing for a further 1 time unit. When one cycle is complete, the data harvesting bot 110 returns to the same Web-Google page to continue what it left off, and once the 1 time- unit expires, it hops to Blog-Blogspot, and so on. As far as each particular server is concerns, there is at least 3 time-unit delay on each site, tricking the server to think that it is not a bot, thus allowing it to harvest the data as much as possible. [0052] The spiral keyword and page engine of the data harvesting bot 110 is disclosed in more details in Malaysia Patent Application no. PI 2019005731 entitled “A system and method to prevent bot detection” which we incorporated herewith by reference. [0053] FIG. 7 illustrates a matrix of objects for semantic similarity comparison in accordance with an embodiment of the present invention. All identified hits in relation to the searched object will be undergone a semantic similarity comparison through the matrix of objects shown. The comparisons of the object pairs are the matrices comprises scores of the comparisons. The present matrix is adapted for text information of the objects found only. For non-text information, such as photos/images, multimedia resources, and etc., a separate matrix may be required. Every matrix can be regarded as a system of intersections of ranges of variables to give a clear interpretation of the spatial relations in the matrix. In one embodiment, text information is compared against non- text information to render scores.

[0054] The following equation provides:

[0055] FIG. 8 shows a process for obtaining semantic information for object in accordance with an embodiment of the present invention. The semantic information comprise the keywords obtained from the user inputs, and related keywords generated based on those inputted keywords.

[0056] FIG. 9 illustrates a scoring process of semantic similarity value to establish data relevancy in accordance with an embodiment of the present invention. The process aims to compare two objects in the matrix and dynamically calculates semantic similarity of the object pairs. The dynamic calculation is required for processing web resources that change in time. The process comprises determining semantic information of the object pair in the matrix from the internet at step 902, calculating overlap semantic information at step 904, and deriving semantic similarity score/value of the object pair for the matrix at step 906. The score/value is calculated and updated dynamically as the information is updated. [0057] The above process is repeated until all the object pairs’ values in the matrix are computed.

[0058] FIG. 10 illustrates schematic diagram of the keyword semantic similarity determination 902 in accordance with one embodiment of the present invention. Each keyword is being processed by a language detector. The language detector references the keyword against a multilingual related keywords corpus. The multilingual related keywords corpus comprises various intemal/external databases, including online encyclopedia sue as Wikipedia, lexical database such as Wordnet, dictionaries, thesaurus, corpus, and other resources. Through the detections, list of related keywords is generated.

[0059] FIG. 11 illustrates a process for detecting language of a keyword in accordance with an embodiment of the present invention. The process comprises inserting keyword terms at step 1102, tokenizing the keyword term at step 1104, comparing each token against language service 1110 at step 1106, and each keyword is tagged with its language at step 1108.

[0060] FIG. 12 illustrates a process of generating related keyword in accordance with an embodiment of the present invention. The process comprises inserting tokenized keyword that tagged with its corresponding language type at step 1202, querying the Wikipedia based on the detected language for each tokenized keyword at step 1204, querying machine readable dictionary (MRD) and thesaurus based on the tokenized keyword at step 1206, and querying Wordnet based on the tokenized keyword at step

1208.

[0061] At step 1204, wherein all hypertext linked words in the returned page are grabbed and stored in a keyword repository 1220, and similarly the synonym words retrieved from the MRD thesaurus and Wordnet are stored on the keyword repository

1220.

[0062] FIG. 13 illustrates a calculation of overlapped semantic information in accordance with an embodiment of the present invention. In this embodiment, the calculation uses a Lesk Algorithm to identify the semantic similarity of Object A and Object B based on the equation below:

(SP1+SP2)

[0063] Semantic Similarity =

2

[0064] where,

[0065] SP1 counts the overlaps of Object A and Object B;

[0066] SP2 is an average of fraction of overlapped keyword of Object A and Object B.

[0067] FIG. 14 exemplifies a search results obtained by the present system and method in accordance with the above embodiments of the present invention. The top half of the figure is an article found by the data crawler, whereby keywords of the article are being categorized as date, location, organization and person. The bottom half of the figure is a table listing out various categorized information of a subject of interest, i.e. “Anwar Ibrahim” found on that page above. [0068] While specific embodiments have been described and illustrated, it is understood that many changes, modifications, variations and combinations thereof could be made to the present invention without departing from the scope of the invention.

Claims

1. A method for profiling an object based on search input, the method comprising: receiving (305) the search input of the object to be profiled, wherein the input include keywords; harvesting (310) data from internet through a data harvesting bot (110) by crawling through pages/sites of the internet; rotating (320) the keywords and the pages/sites through a spiral keyword processor (120) for hopping between targeted pages/sites for avoiding anti-bot mechanisms on the targeted pages/sites; identifying (330) data relevancy based on semantic similarity of the keywords to get most relevant data from the harvested data; identifying (340) the keywords through named entity recognition, NER, processor (140) to extract most relevant data; and outputting (345) the profile of the object in a structured manner with highly relevant data.

2. The method according to claim 1, wherein the identifying (330) data relevancy further comprising: determining (902) semantic information of an object pair in a matrix from the internet; calculating (904) overlap semantic information; and outputting (906) semantic similarity value on the matrix.

3. The method according to claim 2, further comprising: tokenizing (1104) keywords; comparing (1106) each token against language detect service (1110); and tagging (1108) each keyword with its language.

4. The method according to claim 2, further comprising: querying (1202) online encyclopedia based on the detected language for each tokenized keyword; querying (1204) machine readable dictionary, MRD, and thesaurus based on the tokenized keyword; querying (1206) lexical database based on the tokenized keyword; and storing all hypertext linked words extracted from the returned page and the synonym words retrieved from MRD, thesaurus and lexical database on a keyword repository (1220).

5. The method according to claim 1, wherein the semantic similarity is derived based on Lesk Algorithm.

6. An object profiling system for profiling an object based on search input, said system comprising: a graphical user interface, GUI for receiving input of the object to be profiled, wherein the input includes keywords; a data harvesting bot (110) for harvesting data from internet by crawling through pages/sites of the internet; a spiral keyword processor (120), operationally hopping between targeted pages/sides for avoiding anti-bot mechanism on the targeted pages/sites; a semantic similarity processor (130) for establishing data relevancy of most relevance at from many data harvested; a named-entity recognition, NER, processor (140) for classifying the keywords to extract most relevant data; and an output of the profile of the object in a structured manner with the most relevant data.

7. The object profiling system according to claim 6, wherein the semantic similarity processor (130) is adapted to operationally determine semantic information of an object pair in a matrix from the internet, calculating overlap semantic information and outputting semantic similarity value on the matrix.

8. The object profiling system according to claim 7, wherein the semantic similarity processor (130) is adapted for tokenizing keywords, comparing each token against language detect service, and tagging each keyword with its detected language.

9. The object profiling system according to claim 7, wherein the semantic similarity processor (130) is adapted for querying the online encyclopedia on detected language for each tokenized keyword, querying machine readable dictionary, MRD, thesaurus and lexical database based on tokenized keyword, and storing all hypertext linked words extracted from the returned page and the synonym words retrieved from MRD, thesaurus and lexical database on a keyword repository (1220).

10. The object profiling system according to claim 7, wherein the semantic similaritys derived based on a Lesk Algorithm.