WO2022025750A1 - Person profile finder using semantic similarity measurement of object based on internet source and related keywords - Google Patents

Person profile finder using semantic similarity measurement of object based on internet source and related keywords Download PDF

Info

Publication number
WO2022025750A1
WO2022025750A1 PCT/MY2020/050167 MY2020050167W WO2022025750A1 WO 2022025750 A1 WO2022025750 A1 WO 2022025750A1 MY 2020050167 W MY2020050167 W MY 2020050167W WO 2022025750 A1 WO2022025750 A1 WO 2022025750A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
keyword
keywords
semantic similarity
processor
Prior art date
Application number
PCT/MY2020/050167
Other languages
French (fr)
Inventor
Amru Yusrin AMRUDDIN
May Fern KOH
Nardiatul Kasmi MOHAMED KASSIM
Muhammad Awis Jamaluddin JOHARI
Wooi Kin Goon
Original Assignee
Mimos Berhad
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mimos Berhad filed Critical Mimos Berhad
Publication of WO2022025750A1 publication Critical patent/WO2022025750A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/263Language identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/972Access to data in other repository systems, e.g. legacy data or dynamic Web page generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present invention relates to personal profile forming.
  • the present invention relates to a system and method for compiling personal profile information using semantic similarity measurement of object based on internet source and related keywords.
  • Platform dependent search engines only provide search of a person information within what was provided either by the person himself or by other person, i.e. single source.
  • search of a person
  • the search engine will return with all the hits for the input search terms as individual links. User is required to filter the results manually.
  • US Patent Publication no. US2014/0115053A1 discloses a method of finding members of the social media network having common interests and may find use in such social media as common interests, location, profession, age, rank, player skill, etc. The users are only exposed to each other when certain conditions are met.
  • This system is a closed network, whereby the person search is based only on the members of the network. The information regarding the person is as good as what was provided by the members of the network.
  • Canadian Patent Publication no. CA2437456A1 offers a match making system searching for new match. Similarly, the system provide searches over closed networks. The search results are presented as the way it was provided by the owner of the profile.
  • European Patent No. EP159631B1 discloses a search engine that search the Internet (opened web). After a user submits a search request (also referred to as a "query") that includes search terms, the search engine service identifies web pages that may be related to those search terms. The search engine service then displays to the user the links to those web pages in an order that is based on their relevance. The user is to browse through the search results and filter the information manually.
  • a search engine that search the Internet (opened web). After a user submits a search request (also referred to as a "query") that includes search terms, the search engine service identifies web pages that may be related to those search terms. The search engine service then displays to the user the links to those web pages in an order that is based on their relevance. The user is to browse through the search results and filter the information manually.
  • the method further comprises tokenizing keywords; comparing each token against language service; and tagging each keyword with its language.
  • the method further comprises querying the
  • an object profiling system for profiling an object based on search input.
  • the system comprises a GUI for receiving input of the object to be profiled, the input includes keywords; a data harvesting bot for harvesting data from internet; a spiral keyword and page rotation engine, operationally hopping between targeted pages/sides for avoiding anti-bot mechanism on the targeted pages/sites; a semantic similarity module for establishing data relevancy of most relevance at from many data harvested; a named-entity recognition, NER, module for classifying the keywords to extract most relevance data; and an output of the profile of the object in a structured manner with the most relevant data.
  • a GUI for receiving input of the object to be profiled, the input includes keywords; a data harvesting bot for harvesting data from internet; a spiral keyword and page rotation engine, operationally hopping between targeted pages/sides for avoiding anti-bot mechanism on the targeted pages/sites; a semantic similarity module for establishing data relevancy of most relevance at from many data harvested; a named-entity recognition, NER, module for class
  • the semantic similarity module is adapted to operationally determine semantic information of an object pair in a matrix from the internet, calculating overlap semantic information and outputting semantic similarity value on the matrix.
  • the semantic similarity module is adapted for tokenizing keywords; comparing each token against language service; and tagging each keyword with its detected language;
  • the semantic similarity module is adapted for querying the Wikipedia on detected language for each tokenized keyword and machine readable dictionary, MRD, and thesaurus and Wordnet based on tokenized keyword, and storing all hypertext linked words extracted from the returned page and the synonym words retrieved from MRD, thesaurus and Wordnet on a keyword repository.
  • FIG.1 illustrates a block diagram of a person profile finder engine 100 in accordance with an embodiment of the present invention
  • FIG. 2 exemplifies a screenshot for proving inputs in accordance with an embodiment of the present invention
  • FIG. 3 illustrates a person profile search process in accordance with an embodiment of the present invention
  • FIG. 4 illustrate a schematic diagram of the web crawling in accordance with an embodiment of the present invention
  • FIG. 5 illustrates diagrammatic representation of a spiral keywords and pages rotation engine in accordance with an embodiment of the present invention
  • FIG. 7 illustrates a matrix of objects for semantic similarity comparison in accordance with an embodiment of the present invention
  • FIG. 11 illustrates a process for detecting language of a keyword in accordance with an embodiment of the present invention
  • FIG. 12 illustrates a process of generating related keyword in accordance with an embodiment of the present invention
  • FIG. 13 illustrates a calculation of overlapped semantic information in accordance with an embodiment of the present invention
  • FIG. 14 exemplifies a search results obtained by the present system and method in accordance with the above embodiments of the present invention.
  • Embodiments of the present invention include various steps, which will be described below. The steps may be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause a general- purpose or special-purpose processor programmed with the instructions to perform steps. Alternatively, steps may be performed by a combination of hardware, software, firmware, and/or by human operators.
  • FIG. 1 A block diagram illustrating an exemplary computing environment in accordance with various methods described herein.
  • FIG. 1 A block diagram illustrating an exemplary computing environment in accordance with various methods described herein.
  • FIG. 1 A block diagram illustrating an exemplary computing environment in accordance with various methods described herein.
  • FIG. 1 A block diagram illustrating an exemplary computing environment in accordance with various methods described herein.
  • FIG. 1 A block diagram illustrating an exemplary computing device.
  • Cloud computing is an internet-based computing that provides shared processing of resources and data to computers and other devices based on demand.
  • the cloud computing provides access to the resources like networks, servers, storage, applications and services. These resources can be rapidly provisioned and released with minimal management effort in the cloud computing.
  • the embodiments of the present invention enable user to search a person profile in single easy to use user interface and return a list of profile information processed based on internet sources.
  • the results are derived based on measurements of the hits found from internet sources and semantic similarity value that changes dynamically from time to time.
  • the semantic similarity is determined by meaning of keyword extracted from search results and their related keywords such as calculation on difference between two objects is based on the likeness of their meaning, as opposed to their syntactical or visual representation (e.g. the string format, the shape of object, etc.).
  • syntactical or visual representation e.g. the string format, the shape of object, etc.
  • Mahathir include that they were in a same coalition party, both were great leader in Malaysia, both were politicians and ex-UMNO member etc.; but they were different in that Anwar had never been a Prime Minister and had been jailed while Dr. Mahathir had been the Prime Minister twice and had never been jailed.
  • NER named entity recognition
  • the system includes a data harvesting crawler for gathering data from the internet, a spiral keywords and pages rotation processor adapted for bypassing captcha or any anti-bot means during data harvesting, a semantic similarity module for identifying data relevancy and to get most relevance data from many data harvested, and a named-entity recognition (NER) generate entity into relevant pre-defined structured data.
  • NER named-entity recognition
  • the measurement is based on Internet source, and semantic similarity value will change dynamically over time. More specifically, semantic similarity is determined by meaning of keyword extracted from search results and their related keywords.
  • FIG.1 illustrates a block diagram of a person profile finder system 100 in accordance with an embodiment of the present invention.
  • the person profile finder system 100 comprises a data harvesting bot 110, a spiral keyword processor 120, a semantic similarity processor 130 and a named-entity recognition (NER) processor 140 to process an input 101 and to output the processed results 102.
  • the input 101 is user’s input that includes any of object of interest’ s name, photo, email, phone number etc.
  • the processed results 102 is basically a compiled search results of the object of interest.
  • the data harvesting bot 110 is a data crawler or web crawler or web data crawler for systematically browsing the world wide web (WWW) to gather data in relation to the input 101.
  • WWW world wide web
  • FIG. 2 exemplifies a screenshot for proving inputs in accordance with an embodiment of the present invention.
  • the screenshot 200 facilitate fields for the users to input information regarding their object of interest. In this screenshot 200, there are name, email, phone and photo fields for users to fill in. In other embodiment, other fields may also be available for refining search results.
  • FIG. 3 illustrates a person profile search process in accordance with an embodiment of the present invention.
  • the process 300 is carried out by the engine of a system 100 in FIG. 1. Briefly, the process 300 comprises the steps of receiving inputs from user at step 305; harvesting data over the internet at step 310; spiral keywords and pages rotation at step 320; identifying data relevancy based on semantic similarity at step
  • the system 100 acquires user input through a graphical user interface (GUI) comprising fields for user input.
  • GUI graphical user interface
  • Example of the GUI is illustrated in FIG. 2.
  • User provides one or more input information in the filed through the GUI, and once done, the user trigger to initiate the person profile processing.
  • the data harvesting bot 110 crawling the WWW to perform a search based on the supplied information.
  • the search may be carried out through any search engine, such as Google, Yahoo, Bing and/or any other public or proprietary engines.
  • the search engine is performing the search, the spiral keywords and pages rotation method is carried out as and when necessary to bypass captcha or firewall built on the source of the information found. So far, the steps involving gathering the data relevant to the search input.
  • the system 100 performs the semantic similarity processing over the gathered data to extract only the relevant data in relation to the object of interest, i.e. filter-out irrelevant data.
  • the system further performs NER processing through a NER processor 140 to generate entities of high relevance data.
  • the entities are compiled and output in a prescribed format to the user at the step 345.
  • FIG. 4 illustrate a schematic diagram of the web crawling in accordance with an embodiment of the present invention.
  • the web crawling can be executed by the data harvesting bot 110 in FIG. 1.
  • the data harvesting bot search the object of interest over the web. Over the WWW, the harvesting bot 110 searches through the web resources that contains inputted information concerning the object of interest.
  • the web resources include any of the news, webpages, blogs, eBooks, videos, Instagram, images, Linkedln, phone lookup, Facebook, Twitter, and many more. All the hits (results) are stored for further processing.
  • FIG. 5 illustrates diagrammatic representation of a spiral keywords and pages rotation engine in accordance with an embodiment of the present invention.
  • the spiral keywords and page rotation method is performed to avoid anti-bot mechanism by briefly hopping from one targeted server to another targeted server, back again then hop again to extract all the required data.
  • durations of each hop may change at random from a range of time set in a configuration file, for example 10-30 seconds .
  • Such random delay mechanism can effectively avoid, most, if not all, of the anti-bot mechanism or firewall adapted to prevent data crawling.
  • it comprises a spiral centrally disposed on a pane of four segments parted by two perpendicularly crossed axes.
  • the four segments are Blog, social media, news and webpages respectively arranged in a clockwise manner.
  • the spiral line indicates the sequence of hopping between the four segments, wherein each loop of the spiral indicates a cycle of the system hopping from Web > Blog > Social Media > News with the keywords.
  • a Keyword 1 is sent as a query to a website to crawl for data, and for a short stay in that website, the engine hops into a Blog page, then a social media and then news page, then back to a website, so on and so forth. For each hop, the engine gathers some data found in relation to the Keyword 1. Once Keyword 1 is completed, the system send Keyword 2 until all the keywords are exhausted.
  • FIG. 6 exemplifies a comparison of data captured and processed with a conventional method and spiral keyword processor 120.
  • the table on the top exemplifies data that are processed through the conventional method, whereby any data crawler bot stays too long in one category in result will be block by anti-bot ware such as captcha or firewall.
  • the old data harvesting bot search based on the list of pages/sites found and crawls the pages sequentially. Typically, the bot stays on one particular page or site until it harvests all the required information in relation to the keyword in question.
  • “Malaysia” is the keyword, and it took 2 time-units to harvest the required data on Web-Google, then it moves on to Blog-Blogspot with the same keyword “Malaysia”. Similarly, it took 2 time-units over in Blog-Blogspot before it moves on to Social Media-Facebook. In Social Media-Facebook it too stayed for 2 time-units, so on and so forth.
  • the time unit spent on each page or sides would depends on the size of each page and the associated pages which include linked pages found in that page in question. If there are many pages associated to the targeted page/site, the data harvesting hot 110 will stay on the targeted server hosting the targeted page for a considerable amount of time, which will trigger anti-bot mechanism to block the data crawling. Usually, data crawling stops at that point.
  • the table below in FIG. 6 exemplifies data obtained by the data harvesting bot 110 through the spiral keyword processor 120. The data harvesting bot 110 crawls the list of targeted pages/sites spirally, such that the data harvesting bot 110 stays on each page/site for only 1 time unit, then another for also 1 time unit and so on. Through the hopping of web servers, the data harvesting bot 110 is being treated having some form of delay in activity, which is how a real human behave.
  • the data harvesting bot 110 sends “Malaysia” as the keyword to Web-Google and it took 1 time-units to harvest any data harvestable within the time unit then hops to Blog-Blogspot with the same keyword to harvest whatever it can in another time unit then hops Social media -Facebook for another time unit and then News- Bing for a further 1 time unit.
  • the data harvesting bot 110 returns to the same Web-Google page to continue what it left off, and once the 1 time- unit expires, it hops to Blog-Blogspot, and so on.
  • FIG. 7 illustrates a matrix of objects for semantic similarity comparison in accordance with an embodiment of the present invention. All identified hits in relation to the searched object will be undergone a semantic similarity comparison through the matrix of objects shown. The comparisons of the object pairs are the matrices comprises scores of the comparisons.
  • the present matrix is adapted for text information of the objects found only.
  • non-text information such as photos/images, multimedia resources, and etc.
  • a separate matrix may be required. Every matrix can be regarded as a system of intersections of ranges of variables to give a clear interpretation of the spatial relations in the matrix.
  • text information is compared against non- text information to render scores.
  • FIG. 8 shows a process for obtaining semantic information for object in accordance with an embodiment of the present invention.
  • the semantic information comprise the keywords obtained from the user inputs, and related keywords generated based on those inputted keywords.
  • FIG. 9 illustrates a scoring process of semantic similarity value to establish data relevancy in accordance with an embodiment of the present invention.
  • the process aims to compare two objects in the matrix and dynamically calculates semantic similarity of the object pairs.
  • the dynamic calculation is required for processing web resources that change in time.
  • the process comprises determining semantic information of the object pair in the matrix from the internet at step 902, calculating overlap semantic information at step 904, and deriving semantic similarity score/value of the object pair for the matrix at step 906.
  • the score/value is calculated and updated dynamically as the information is updated.
  • the above process is repeated until all the object pairs’ values in the matrix are computed.
  • FIG. 10 illustrates schematic diagram of the keyword semantic similarity determination 902 in accordance with one embodiment of the present invention.
  • Each keyword is being processed by a language detector.
  • the language detector references the keyword against a multilingual related keywords corpus.
  • the multilingual related keywords corpus comprises various intemal/external databases, including online encyclopedia sue as Wikipedia, lexical database such as Wordnet, dictionaries, thesaurus, corpus, and other resources. Through the detections, list of related keywords is generated.
  • FIG. 11 illustrates a process for detecting language of a keyword in accordance with an embodiment of the present invention.
  • the process comprises inserting keyword terms at step 1102, tokenizing the keyword term at step 1104, comparing each token against language service 1110 at step 1106, and each keyword is tagged with its language at step 1108.
  • FIG. 12 illustrates a process of generating related keyword in accordance with an embodiment of the present invention.
  • the process comprises inserting tokenized keyword that tagged with its corresponding language type at step 1202, querying the Wikipedia based on the detected language for each tokenized keyword at step 1204, querying machine readable dictionary (MRD) and thesaurus based on the tokenized keyword at step 1206, and querying Wordnet based on the tokenized keyword at step
  • MRD machine readable dictionary
  • step 1204 wherein all hypertext linked words in the returned page are grabbed and stored in a keyword repository 1220, and similarly the synonym words retrieved from the MRD thesaurus and Wordnet are stored on the keyword repository
  • FIG. 13 illustrates a calculation of overlapped semantic information in accordance with an embodiment of the present invention.
  • the calculation uses a Lesk Algorithm to identify the semantic similarity of Object A and Object B based on the equation below:
  • SP1 counts the overlaps of Object A and Object B
  • SP2 is an average of fraction of overlapped keyword of Object A and Object B.
  • FIG. 14 exemplifies a search results obtained by the present system and method in accordance with the above embodiments of the present invention.
  • the top half of the figure is an article found by the data crawler, whereby keywords of the article are being categorized as date, location, organization and person.
  • the bottom half of the figure is a table listing out various categorized information of a subject of interest, i.e. “Anwaryama” found on that page above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a method for profiling an object based on search input. The method comprises receiving (305) the search input of the object to be profiled, the inputs include keywords; harvesting (310) data from internet through a data harvesting bot (110); rotating keywords and pages through a spiral keyword processor (120) for hopping between targeted pages/sites for avoiding anti-bot mechanisms on the targeted pages/sites; identifying (330) data relevancy based on semantic similarity of the keywords to get most relevant data from the harvested data; identifying (340) the keywords through named entity recognition, NER, processor (140) to extract most relevance data; and outputting (345) the profile of the object in a structured manner with highly relevant data. An object profiling system is also provided.

Description

PERSON PROFILE FINDER USING SEMANTIC SIMILARITY MEASUREMENT OF OBJECT BASED ON INTERNET SOURCE AND RELATED KEYWORDS
Field of the Invention
[0001] The present invention relates to personal profile forming. In particular, the present invention relates to a system and method for compiling personal profile information using semantic similarity measurement of object based on internet source and related keywords.
Background
[0002] In year 2019, there are estimated 2.95 billion people using social media around the world (published by J. Clement, Apr 1, 2020, on https://www.statista.com). Given any random name of a person, there is no surprise that there are tons of hits from the search result, unless the name is a unique one.
[0003] Platform dependent search engines only provide search of a person information within what was provided either by the person himself or by other person, i.e. single source. When the search (of a person) is conducted in the opened web, the search engine will return with all the hits for the input search terms as individual links. User is required to filter the results manually.
[0004] US Patent Publication no. US2014/0115053A1 discloses a method of finding members of the social media network having common interests and may find use in such social media as common interests, location, profession, age, rank, player skill, etc. The users are only exposed to each other when certain conditions are met. This system is a closed network, whereby the person search is based only on the members of the network. The information regarding the person is as good as what was provided by the members of the network.
[0005] Canadian Patent Publication no. CA2437456A1 offers a match making system searching for new match. Similarly, the system provide searches over closed networks. The search results are presented as the way it was provided by the owner of the profile.
[0006] European Patent No. EP159631B1 discloses a search engine that search the Internet (opened web). After a user submits a search request (also referred to as a "query") that includes search terms, the search engine service identifies web pages that may be related to those search terms. The search engine service then displays to the user the links to those web pages in an order that is based on their relevance. The user is to browse through the search results and filter the information manually.
[0007] There existing a need for system and method to facilitate search of a person over an open web that collect and compile the information relating to the person of interest.
Summary
[0008] In one aspect of the present invention, there is provided a method for profiling an object based on search input. The method comprises receiving the search input of the object to be profiled, the inputs include keywords; harvesting data from internet through a data harvesting bot; rotating keywords and pages through a spiral keyword and page rotation engine for hopping between targeted pages/sites for avoiding anti-bot mechanisms on the targeted pages/sites; establishing data relevancy based on semantic similarity of the keywords to get most relevant data from data harvested; classifying the keywords through named entity recognition, NER, module to extract most relevance data; and outputting the profile of the object in a structured manner with highly relevant data. [0009] In one embodiment, the step of establishing data relevancy further comprises determining semantic information of an object pair in a matrix from the internet; calculating overlap semantic information; and outputting semantic similarity value on the matrix.
[0010] In another embodiment, the method further comprises tokenizing keywords; comparing each token against language service; and tagging each keyword with its language.
[0011] In yet another embodiment, the method further comprises querying the
Wikipedia based on the detected language for each tokenized keyword; querying machine readable dictionary, MRD, and thesaurus based on the tokenized keyword; and querying Wordnet based on the tokenized keyword; and storing all hypertext linked words extracted from the returned page and the synonym words retrieved from MRD, thesaurus and Wordnet on a keyword repository.
[0012] In another aspect of the present invention, there is provided an object profiling system for profiling an object based on search input. The system comprises a GUI for receiving input of the object to be profiled, the input includes keywords; a data harvesting bot for harvesting data from internet; a spiral keyword and page rotation engine, operationally hopping between targeted pages/sides for avoiding anti-bot mechanism on the targeted pages/sites; a semantic similarity module for establishing data relevancy of most relevance at from many data harvested; a named-entity recognition, NER, module for classifying the keywords to extract most relevance data; and an output of the profile of the object in a structured manner with the most relevant data. [0013] In one embodiment, the semantic similarity module is adapted to operationally determine semantic information of an object pair in a matrix from the internet, calculating overlap semantic information and outputting semantic similarity value on the matrix. [0014] In another embodiment, the semantic similarity module is adapted for tokenizing keywords; comparing each token against language service; and tagging each keyword with its detected language;
[0015] In a further embodiment, the semantic similarity module is adapted for querying the Wikipedia on detected language for each tokenized keyword and machine readable dictionary, MRD, and thesaurus and Wordnet based on tokenized keyword, and storing all hypertext linked words extracted from the returned page and the synonym words retrieved from MRD, thesaurus and Wordnet on a keyword repository. Brief Description of the Drawings
[0016] This invention will be described by way of non-limiting embodiments of the present invention, with reference to the accompanying drawings, in which:
[0017] FIG.1 illustrates a block diagram of a person profile finder engine 100 in accordance with an embodiment of the present invention; [0018] FIG. 2 exemplifies a screenshot for proving inputs in accordance with an embodiment of the present invention;
[0019] FIG. 3 illustrates a person profile search process in accordance with an embodiment of the present invention; [0020] FIG. 4 illustrate a schematic diagram of the web crawling in accordance with an embodiment of the present invention;
[0021] FIG. 5 illustrates diagrammatic representation of a spiral keywords and pages rotation engine in accordance with an embodiment of the present invention;
[0022] FIG. 6 exemplifies a comparison of data captured and processed with a conventional method and spiral keyword and page rotation engine;
[0023] FIG. 7 illustrates a matrix of objects for semantic similarity comparison in accordance with an embodiment of the present invention;
[0024] FIG. 8 shows a process for obtaining semantic information for object in accordance with an embodiment of the present invention; [0025] FIG. 9 illustrates a scoring process of semantic similarity value in accordance with an embodiment of the present invention;
[0026] FIG. 10 illustrates schematic diagram of the keyword semantic similarity generation in accordance with one embodiment of the present invention;
[0027] FIG. 11 illustrates a process for detecting language of a keyword in accordance with an embodiment of the present invention; [0028] FIG. 12 illustrates a process of generating related keyword in accordance with an embodiment of the present invention;
[0029] FIG. 13 illustrates a calculation of overlapped semantic information in accordance with an embodiment of the present invention; and [0030] FIG. 14 exemplifies a search results obtained by the present system and method in accordance with the above embodiments of the present invention.
Detailed Description
[0031] In line with the above summary, the following description of a number of specific and alternative embodiments are provided to understand the inventive features of the present invention. It shall be apparent to one skilled in the art, however that this invention may be practiced without such specific details. Some of the details may not be described at length so as not to obscure the invention. For ease of reference, common reference numerals will be used throughout the figures when referring to the same or similar features common to the figures. [0032] Embodiments of the present invention include various steps, which will be described below. The steps may be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause a general- purpose or special-purpose processor programmed with the instructions to perform steps. Alternatively, steps may be performed by a combination of hardware, software, firmware, and/or by human operators.
[0033] Embodiments of the present invention may be provided as a computer program product, which may include a machine-readable storage medium tangibly embodying thereon instructions, which may be used to program a computer (or other electronic devices) to perform a process. The machine-readable medium may include, but is not limited to, data storage drives be it magnetic or optical disks, and/or semiconductor-based memories, such as RAMs, ROMs, flash memory, etc., for storing digital instructions executable by any processing devices.
[0034] Various methods described herein may be practiced by combining one or more machine -readable storage media containing the code according to the present invention with appropriate standard computer hardware to execute the code contained therein. An apparatus for practicing various embodiments of the present invention may involve one or more computers (or one or more processors within a single computer) and storage systems containing or having network access to computer program(s) coded in accordance with various methods described herein, and the method steps of the invention could be accomplished by modules, routines, subroutines, or subparts of a computer program product. [0035] The present invention is implementable on both local and remote computing device. A local computer include a personal computer or a handheld mobile device. A remote computer may be a remote server or cloud computing device, which includes virtual machine, which generally refers to a self-contained operating environment that behaves as if it is a separate computer even though is part of a separate computer or may be virtualized using resources form multiple computers.
[0036] In one embodiment, software applications can be developed based on available platforms or platform products to build new functionality as an extension to the existing functionality. For example, a platform can be provided by a third party, e.g. a cloud computing platform that allows provisioning of the software application in a cloud environment.
[0037] Cloud computing is an internet-based computing that provides shared processing of resources and data to computers and other devices based on demand. The cloud computing provides access to the resources like networks, servers, storage, applications and services. These resources can be rapidly provisioned and released with minimal management effort in the cloud computing.
[0038] The embodiments of the present invention enable user to search a person profile in single easy to use user interface and return a list of profile information processed based on internet sources. The results are derived based on measurements of the hits found from internet sources and semantic similarity value that changes dynamically from time to time. The semantic similarity is determined by meaning of keyword extracted from search results and their related keywords such as calculation on difference between two objects is based on the likeness of their meaning, as opposed to their syntactical or visual representation (e.g. the string format, the shape of object, etc.). However, there is a situation where the meaning of the two objects are different, and yet they are similar in many aspect. For example, the similarities of “Anwar Ibrahim” and “Dr. Mahathir” include that they were in a same coalition party, both were great leader in Malaysia, both were politicians and ex-UMNO member etc.; but they were different in that Anwar had never been a Prime Minister and had been jailed while Dr. Mahathir had been the Prime Minister twice and had never been jailed. With a named entity recognition (NER) process, semantic similarity of the two named objects can be identify and the results is refined to give high accuracy result to user. [0039] It is therefore an objective to provide a system and method for person profile forming based on information found over the opened web. The system includes a data harvesting crawler for gathering data from the internet, a spiral keywords and pages rotation processor adapted for bypassing captcha or any anti-bot means during data harvesting, a semantic similarity module for identifying data relevancy and to get most relevance data from many data harvested, and a named-entity recognition (NER) generate entity into relevant pre-defined structured data. In one embodiment, the measurement is based on Internet source, and semantic similarity value will change dynamically over time. More specifically, semantic similarity is determined by meaning of keyword extracted from search results and their related keywords.
[0040] FIG.1 illustrates a block diagram of a person profile finder system 100 in accordance with an embodiment of the present invention. The person profile finder system 100 comprises a data harvesting bot 110, a spiral keyword processor 120, a semantic similarity processor 130 and a named-entity recognition (NER) processor 140 to process an input 101 and to output the processed results 102. The input 101 is user’s input that includes any of object of interest’ s name, photo, email, phone number etc. The processed results 102 is basically a compiled search results of the object of interest. The data harvesting bot 110 is a data crawler or web crawler or web data crawler for systematically browsing the world wide web (WWW) to gather data in relation to the input 101. The spiral keyword processor 120 provides a first round of processing adapted to bypass the captcha or firewall while the bot 110 is crawling the data. The semantic similarity processor 130 processes the data crawled to extract those that is most relevant to the input 101. The NER processor 140 processes the data and generate entities from the relevant data. [0041] FIG. 2 exemplifies a screenshot for proving inputs in accordance with an embodiment of the present invention. The screenshot 200 facilitate fields for the users to input information regarding their object of interest. In this screenshot 200, there are name, email, phone and photo fields for users to fill in. In other embodiment, other fields may also be available for refining search results.
[0042] FIG. 3 illustrates a person profile search process in accordance with an embodiment of the present invention. The process 300 is carried out by the engine of a system 100 in FIG. 1. Briefly, the process 300 comprises the steps of receiving inputs from user at step 305; harvesting data over the internet at step 310; spiral keywords and pages rotation at step 320; identifying data relevancy based on semantic similarity at step
330; and identifying keywords through NER processing at step 340.
[0043] Returning to the step 305, the system 100 acquires user input through a graphical user interface (GUI) comprising fields for user input. Example of the GUI is illustrated in FIG. 2. User provides one or more input information in the filed through the GUI, and once done, the user trigger to initiate the person profile processing. At the step 310, the data harvesting bot 110 crawling the WWW to perform a search based on the supplied information. The search may be carried out through any search engine, such as Google, Yahoo, Bing and/or any other public or proprietary engines. As the search engine is performing the search, the spiral keywords and pages rotation method is carried out as and when necessary to bypass captcha or firewall built on the source of the information found. So far, the steps involving gathering the data relevant to the search input. [0044] Referring now to the step 330, the system 100 performs the semantic similarity processing over the gathered data to extract only the relevant data in relation to the object of interest, i.e. filter-out irrelevant data. At the step 340, the system further performs NER processing through a NER processor 140 to generate entities of high relevance data. The entities are compiled and output in a prescribed format to the user at the step 345.
[0045] FIG. 4 illustrate a schematic diagram of the web crawling in accordance with an embodiment of the present invention. The web crawling can be executed by the data harvesting bot 110 in FIG. 1. As provided earlier, the data harvesting bot search the object of interest over the web. Over the WWW, the harvesting bot 110 searches through the web resources that contains inputted information concerning the object of interest. The web resources include any of the news, webpages, blogs, eBooks, videos, Instagram, images, Linkedln, phone lookup, Facebook, Twitter, and many more. All the hits (results) are stored for further processing. [0046] FIG. 5 illustrates diagrammatic representation of a spiral keywords and pages rotation engine in accordance with an embodiment of the present invention. Briefly, the spiral keywords and page rotation method is performed to avoid anti-bot mechanism by briefly hopping from one targeted server to another targeted server, back again then hop again to extract all the required data. In one embodiment, durations of each hop may change at random from a range of time set in a configuration file, for example 10-30 seconds . Such random delay mechanism can effectively avoid, most, if not all, of the anti-bot mechanism or firewall adapted to prevent data crawling. As shown in the diagram, it comprises a spiral centrally disposed on a pane of four segments parted by two perpendicularly crossed axes. In this embodiment, the four segments are Blog, social media, news and webpages respectively arranged in a clockwise manner. The spiral line indicates the sequence of hopping between the four segments, wherein each loop of the spiral indicates a cycle of the system hopping from Web > Blog > Social Media > News with the keywords. [0047] As shown, a Keyword 1 is sent as a query to a website to crawl for data, and for a short stay in that website, the engine hops into a Blog page, then a social media and then news page, then back to a website, so on and so forth. For each hop, the engine gathers some data found in relation to the Keyword 1. Once Keyword 1 is completed, the system send Keyword 2 until all the keywords are exhausted. [0048] FIG. 6 exemplifies a comparison of data captured and processed with a conventional method and spiral keyword processor 120. The table on the top exemplifies data that are processed through the conventional method, whereby any data crawler bot stays too long in one category in result will be block by anti-bot ware such as captcha or firewall. Basically, the old data harvesting bot search based on the list of pages/sites found and crawls the pages sequentially. Typically, the bot stays on one particular page or site until it harvests all the required information in relation to the keyword in question. In this table, “Malaysia” is the keyword, and it took 2 time-units to harvest the required data on Web-Google, then it moves on to Blog-Blogspot with the same keyword “Malaysia”. Similarly, it took 2 time-units over in Blog-Blogspot before it moves on to Social Media-Facebook. In Social Media-Facebook it too stayed for 2 time-units, so on and so forth.
[0049] The time unit spent on each page or sides would depends on the size of each page and the associated pages which include linked pages found in that page in question. If there are many pages associated to the targeted page/site, the data harvesting hot 110 will stay on the targeted server hosting the targeted page for a considerable amount of time, which will trigger anti-bot mechanism to block the data crawling. Usually, data crawling stops at that point. [0050] The table below in FIG. 6 exemplifies data obtained by the data harvesting bot 110 through the spiral keyword processor 120. The data harvesting bot 110 crawls the list of targeted pages/sites spirally, such that the data harvesting bot 110 stays on each page/site for only 1 time unit, then another for also 1 time unit and so on. Through the hopping of web servers, the data harvesting bot 110 is being treated having some form of delay in activity, which is how a real human behave.
[0051] In the table, the data harvesting bot 110 sends “Malaysia” as the keyword to Web-Google and it took 1 time-units to harvest any data harvestable within the time unit then hops to Blog-Blogspot with the same keyword to harvest whatever it can in another time unit then hops Social media -Facebook for another time unit and then News- Bing for a further 1 time unit. When one cycle is complete, the data harvesting bot 110 returns to the same Web-Google page to continue what it left off, and once the 1 time- unit expires, it hops to Blog-Blogspot, and so on. As far as each particular server is concerns, there is at least 3 time-unit delay on each site, tricking the server to think that it is not a bot, thus allowing it to harvest the data as much as possible. [0052] The spiral keyword and page engine of the data harvesting bot 110 is disclosed in more details in Malaysia Patent Application no. PI 2019005731 entitled “A system and method to prevent bot detection” which we incorporated herewith by reference. [0053] FIG. 7 illustrates a matrix of objects for semantic similarity comparison in accordance with an embodiment of the present invention. All identified hits in relation to the searched object will be undergone a semantic similarity comparison through the matrix of objects shown. The comparisons of the object pairs are the matrices comprises scores of the comparisons. The present matrix is adapted for text information of the objects found only. For non-text information, such as photos/images, multimedia resources, and etc., a separate matrix may be required. Every matrix can be regarded as a system of intersections of ranges of variables to give a clear interpretation of the spatial relations in the matrix. In one embodiment, text information is compared against non- text information to render scores.
[0054] The following equation provides:
Figure imgf000016_0001
[0055] FIG. 8 shows a process for obtaining semantic information for object in accordance with an embodiment of the present invention. The semantic information comprise the keywords obtained from the user inputs, and related keywords generated based on those inputted keywords.
[0056] FIG. 9 illustrates a scoring process of semantic similarity value to establish data relevancy in accordance with an embodiment of the present invention. The process aims to compare two objects in the matrix and dynamically calculates semantic similarity of the object pairs. The dynamic calculation is required for processing web resources that change in time. The process comprises determining semantic information of the object pair in the matrix from the internet at step 902, calculating overlap semantic information at step 904, and deriving semantic similarity score/value of the object pair for the matrix at step 906. The score/value is calculated and updated dynamically as the information is updated. [0057] The above process is repeated until all the object pairs’ values in the matrix are computed.
[0058] FIG. 10 illustrates schematic diagram of the keyword semantic similarity determination 902 in accordance with one embodiment of the present invention. Each keyword is being processed by a language detector. The language detector references the keyword against a multilingual related keywords corpus. The multilingual related keywords corpus comprises various intemal/external databases, including online encyclopedia sue as Wikipedia, lexical database such as Wordnet, dictionaries, thesaurus, corpus, and other resources. Through the detections, list of related keywords is generated.
[0059] FIG. 11 illustrates a process for detecting language of a keyword in accordance with an embodiment of the present invention. The process comprises inserting keyword terms at step 1102, tokenizing the keyword term at step 1104, comparing each token against language service 1110 at step 1106, and each keyword is tagged with its language at step 1108.
[0060] FIG. 12 illustrates a process of generating related keyword in accordance with an embodiment of the present invention. The process comprises inserting tokenized keyword that tagged with its corresponding language type at step 1202, querying the Wikipedia based on the detected language for each tokenized keyword at step 1204, querying machine readable dictionary (MRD) and thesaurus based on the tokenized keyword at step 1206, and querying Wordnet based on the tokenized keyword at step
1208.
[0061] At step 1204, wherein all hypertext linked words in the returned page are grabbed and stored in a keyword repository 1220, and similarly the synonym words retrieved from the MRD thesaurus and Wordnet are stored on the keyword repository
1220.
[0062] FIG. 13 illustrates a calculation of overlapped semantic information in accordance with an embodiment of the present invention. In this embodiment, the calculation uses a Lesk Algorithm to identify the semantic similarity of Object A and Object B based on the equation below:
(SP1+SP2)
[0063] Semantic Similarity =
2
[0064] where,
[0065] SP1 counts the overlaps of Object A and Object B;
[0066] SP2 is an average of fraction of overlapped keyword of Object A and Object B.
[0067] FIG. 14 exemplifies a search results obtained by the present system and method in accordance with the above embodiments of the present invention. The top half of the figure is an article found by the data crawler, whereby keywords of the article are being categorized as date, location, organization and person. The bottom half of the figure is a table listing out various categorized information of a subject of interest, i.e. “Anwar Ibrahim” found on that page above. [0068] While specific embodiments have been described and illustrated, it is understood that many changes, modifications, variations and combinations thereof could be made to the present invention without departing from the scope of the invention.

Claims

Claims
1. A method for profiling an object based on search input, the method comprising: receiving (305) the search input of the object to be profiled, wherein the input include keywords; harvesting (310) data from internet through a data harvesting bot (110) by crawling through pages/sites of the internet; rotating (320) the keywords and the pages/sites through a spiral keyword processor (120) for hopping between targeted pages/sites for avoiding anti-bot mechanisms on the targeted pages/sites; identifying (330) data relevancy based on semantic similarity of the keywords to get most relevant data from the harvested data; identifying (340) the keywords through named entity recognition, NER, processor (140) to extract most relevant data; and outputting (345) the profile of the object in a structured manner with highly relevant data.
2. The method according to claim 1, wherein the identifying (330) data relevancy further comprising: determining (902) semantic information of an object pair in a matrix from the internet; calculating (904) overlap semantic information; and outputting (906) semantic similarity value on the matrix.
3. The method according to claim 2, further comprising: tokenizing (1104) keywords; comparing (1106) each token against language detect service (1110); and tagging (1108) each keyword with its language.
4. The method according to claim 2, further comprising: querying (1202) online encyclopedia based on the detected language for each tokenized keyword; querying (1204) machine readable dictionary, MRD, and thesaurus based on the tokenized keyword; querying (1206) lexical database based on the tokenized keyword; and storing all hypertext linked words extracted from the returned page and the synonym words retrieved from MRD, thesaurus and lexical database on a keyword repository (1220).
5. The method according to claim 1, wherein the semantic similarity is derived based on Lesk Algorithm.
6. An object profiling system for profiling an object based on search input, said system comprising: a graphical user interface, GUI for receiving input of the object to be profiled, wherein the input includes keywords; a data harvesting bot (110) for harvesting data from internet by crawling through pages/sites of the internet; a spiral keyword processor (120), operationally hopping between targeted pages/sides for avoiding anti-bot mechanism on the targeted pages/sites; a semantic similarity processor (130) for establishing data relevancy of most relevance at from many data harvested; a named-entity recognition, NER, processor (140) for classifying the keywords to extract most relevant data; and an output of the profile of the object in a structured manner with the most relevant data.
7. The object profiling system according to claim 6, wherein the semantic similarity processor (130) is adapted to operationally determine semantic information of an object pair in a matrix from the internet, calculating overlap semantic information and outputting semantic similarity value on the matrix.
8. The object profiling system according to claim 7, wherein the semantic similarity processor (130) is adapted for tokenizing keywords, comparing each token against language detect service, and tagging each keyword with its detected language.
9. The object profiling system according to claim 7, wherein the semantic similarity processor (130) is adapted for querying the online encyclopedia on detected language for each tokenized keyword, querying machine readable dictionary, MRD, thesaurus and lexical database based on tokenized keyword, and storing all hypertext linked words extracted from the returned page and the synonym words retrieved from MRD, thesaurus and lexical database on a keyword repository (1220).
10. The object profiling system according to claim 7, wherein the semantic similaritys derived based on a Lesk Algorithm.
PCT/MY2020/050167 2020-07-30 2020-11-24 Person profile finder using semantic similarity measurement of object based on internet source and related keywords WO2022025750A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
MYPI2020003948 2020-07-30
MYPI2020003948 2020-07-30

Publications (1)

Publication Number Publication Date
WO2022025750A1 true WO2022025750A1 (en) 2022-02-03

Family

ID=80035974

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/MY2020/050167 WO2022025750A1 (en) 2020-07-30 2020-11-24 Person profile finder using semantic similarity measurement of object based on internet source and related keywords

Country Status (1)

Country Link
WO (1) WO2022025750A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023191317A1 (en) * 2022-04-01 2023-10-05 주식회사 솔트룩스 Method, device, and computer-readable recording medium for monitoring risk or opportunity event related to user-customized topic through deep signal

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20110115542A (en) * 2010-04-15 2011-10-21 팔로 알토 리서치 센터 인코포레이티드 Method for calculating semantic similarities between messages and conversations based on enhanced entity extraction
US9436760B1 (en) * 2016-02-05 2016-09-06 Quid, Inc. Measuring accuracy of semantic graphs with exogenous datasets
US20160323398A1 (en) * 2015-04-28 2016-11-03 Microsoft Technology Licensing, Llc Contextual people recommendations

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20110115542A (en) * 2010-04-15 2011-10-21 팔로 알토 리서치 센터 인코포레이티드 Method for calculating semantic similarities between messages and conversations based on enhanced entity extraction
US20160323398A1 (en) * 2015-04-28 2016-11-03 Microsoft Technology Licensing, Llc Contextual people recommendations
US9436760B1 (en) * 2016-02-05 2016-09-06 Quid, Inc. Measuring accuracy of semantic graphs with exogenous datasets

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KIM YONG-YOUNG, YONG-KI KIM, DAE-SIK KIM, MI-HYE KIM: "Issue Analysis on Gas Safety Based on a Distributed Web Crawler Using Amazon Web Services", vol. 16, no. 12, 31 December 2018 (2018-12-31), pages 317 - 325, XP055890776, ISSN: 2713-6442, DOI: 10.14400/JDC.2018.16.12.317 *
TORRES SULEMA, GELBUKH ALEXANDER: "Comparing Similarity Measures for Original WSD Lesk Algorithm", 31 January 2009 (2009-01-31), pages 155 - 166, XP055890774, Retrieved from the Internet <URL:http://nlp.cic.ipn.mx/Publications/2009/Comparing%20Similarity%20Measures%20for%20Original%20WSD%20Lesk%20Algorithm.pdf> [retrieved on 20220211] *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023191317A1 (en) * 2022-04-01 2023-10-05 주식회사 솔트룩스 Method, device, and computer-readable recording medium for monitoring risk or opportunity event related to user-customized topic through deep signal

Similar Documents

Publication Publication Date Title
Thakur et al. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models
US11698908B2 (en) Content inversion for user searches and product recommendations systems and methods
US10755179B2 (en) Methods and apparatus for identifying concepts corresponding to input information
JP5379696B2 (en) Information retrieval system, method and software with concept-based retrieval and ranking
US10984056B2 (en) Systems and methods for evaluating search query terms for improving search results
US20060287988A1 (en) Keyword charaterization and application
CN111417940A (en) Evidence search supporting complex answers
US8825620B1 (en) Behavioral word segmentation for use in processing search queries
KR20150031234A (en) Updating a search index used to facilitate application searches
US20160103917A1 (en) Automatic clustering by topic and prioritizing onlne feed items
US10482390B2 (en) Information discovery system
WO2016016733A1 (en) Method of and a system for website ranking using an appeal factor
Liu et al. Linkso: a dataset for learning to retrieve similar question answer pairs on software development forums
Bibi et al. Web search personalization using machine learning techniques
WO2022025750A1 (en) Person profile finder using semantic similarity measurement of object based on internet source and related keywords
Uddin et al. Resolving API mentions in informal documents
Wei et al. API recommendation for machine learning libraries: how far are we?
CN107798091B (en) Data crawling method and related equipment thereof
Li et al. Reading behaviour based user interests model and its application in recommender system
Rani et al. A link-click-concept based Ranking Algorithm for Ranking Search Results
Çifçi et al. A search service for food consumption mobile applications via hadoop and mapreduce technology
Böhm et al. Sprint: ranking search results by paths
US20160260151A1 (en) Search engine optimization for category web pages
Sejal et al. Qrgqr: Query relevance graph for query recommendation
Ariyanto Development of smart web crawler by applying Breadth-First algorithm and vector space model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20946706

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20946706

Country of ref document: EP

Kind code of ref document: A1