CN112560483A

CN112560483A - Automatic detection of personal information in free text

Info

Publication number: CN112560483A
Application number: CN202011013395.0A
Authority: CN
Inventors: A·芬克尔施泰因; B·哈伊姆; E·梅纳赫姆
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2019-09-25
Filing date: 2020-09-24
Publication date: 2021-03-26
Also published as: US11429790B2; US20210089620A1

Abstract

The invention relates to automatic detection of personal information in free text, comprising: automatically applying a Named Entity Recognition (NER) algorithm to the digital text document to detect a named entity appearing in the digital text document, wherein the named entity is selected from the group consisting of: at least one person-type entity, and at least one non-person-type entity; automatically detecting at least one relationship between named entities by applying a part-of-speech (POS) tagging algorithm and a dependency resolution algorithm to sentences of a digital text document containing the detected named entities; automatically estimating whether at least one relationship between named entities represents personal information; and automatically sending out a notice of the estimation result.

Description

Automatic detection of personal information in free text

Technical Field

The present invention relates to the field of automatic text analysis.

Background

Recent global increases in information privacy regulations have resulted in various techniques for assessing whether digitally stored information complies with such regulations. In addition, the growth of security attacks on sensitive data storage has also fueled the development of these technologies, so organizations can allocate resources to protect high-risk databases and storage systems.

Such techniques provide a risk assessment tool with respect to compliance with GDPR, PCI, HIPAA, CCPA, LGPD, and other regulations by using sophisticated data classification techniques, vulnerability scanning, and risk scoring.

One such tool is IBM corporation's Security guard Analyzer, which is intended to help identify regulated data risks by analyzing in-house deployments and cloud databases to look up and provide users with prioritized risk information. It includes a classification engine that searches data in database tables, performs vulnerability scanning, and discovers current threats.

The foregoing examples of related art and limitations related thereto are intended to be illustrative and not exclusive. Other limitations of the related art will become apparent to those of skill in the art upon a reading of the specification and a study of the drawings.

Disclosure of Invention

The following embodiments and aspects thereof are described and illustrated in conjunction with systems, tools, and methods, which are meant to be exemplary and illustrative, not limiting in scope.

One embodiment relates to a method comprising operating at least one hardware processor to: automatically applying a named-entity recognition (NER) algorithm to the digital text document to detect named entities appearing in the digital text document, wherein the named entities are selected from the group consisting of: at least one person-type entity, and at least one non-person-type entity; automatically detecting at least one relationship between named entities by applying parts-of-speech (POS) tagging and dependency resolution algorithms to sentences of a digital text document containing the detected named entities; automatically estimating whether at least one relationship between named entities represents personal information; and automatically sending out a notice of the estimation result.

Another embodiment relates to a system, comprising: (a) at least one hardware processor; (b) a computer readable storage medium having program code embodied thereon, the program code executable by the at least one hardware processor to: automatically applying a Named Entity Recognition (NER) algorithm to the digital text document to detect a named entity appearing in the digital text document, wherein the named entity is selected from the group consisting of: at least one person-type entity, and at least one non-person-type entity; automatically detecting at least one relationship between named entities by applying a part-of-speech (POS) tagging algorithm and a dependency parsing algorithm to sentences of a digital text document containing the detected named entities; automatically estimating whether at least one relationship between named entities represents personal information; and automatically sends out a notice of the estimation result.

Another embodiment relates to a computer program product comprising a computer readable storage medium having program code embodied therewith, the program code executable by at least one hardware processor to: automatically applying a Named Entity Recognition (NER) algorithm to the digital text document to detect a named entity appearing in the digital text document, wherein the named entity is selected from the group consisting of: at least one person-type entity, and at least one non-person-type entity; automatically detecting at least one relationship between named entities by applying a part-of-speech (POS) tagging algorithm and a dependency parsing algorithm to sentences of a digital text document containing the detected named entities; automatically estimating whether at least one relationship between named entities represents personal information; and automatically sends out a notice of the estimation result.

In some embodiments, the method further comprises, or the program code may further execute to: in the digital text document, pronouns associated with at least one personal-type entity are replaced with nouns for the names of the at least one personal-type entity.

In some embodiments, the method further comprises, or the program code may further execute to: prior to automatically applying the NER algorithm, automatically pre-processing the digital text document by at least one of: detecting a predominant language of the digital text document, thereby selecting the NER algorithm to match the predominant language; deleting at least one of the following from the digital text document: blank and technical characters; and correcting spelling errors in the digital text document.

In some embodiments, the at least one non-person type entity is selected from the group consisting of: organization, object, location, nationality, time, date, address, artwork, event, marital status, occupation, money, language, and quantity.

In some embodiments, the method further comprises, or the program code may further execute to: automatically applying different Named Entity Recognition (NER) algorithms to a digital text document; and applying one or more predefined rules to resolve one or more conflicts between the named entities detected by the NER algorithm and the different NER algorithms.

In some embodiments, the method further comprises, or the program code may further execute to: the named entities are filtered and at least some of the named entities are merged.

In some embodiments, said automatic detection of at least one relationship between named entities further comprises: determining a dependency path connecting every two named entities in each sentence using the results of the applied dependency resolution algorithm; selecting a text expression (textual expression) located within the dependency path; and associating each textual expression with a relationship type selected from a predefined set of relationship types.

In some embodiments, the automatically estimating comprises calculating a privacy score for the digital text document, or for each of the at least one person-type entities, based on: a first set of predefined scores associated with the relationship types, wherein each score of the first set represents a likelihood that the respective relationship type is part of the personal information; and a second set of predefined scores associated with the named entities, wherein each score of the second set indicates a likelihood that the respective named entity is part of the personal information.

In some embodiments, the method further comprises, or the program code may further execute to: automatically detecting that at least one person-type entity includes at least a portion of a person name; automatically applying an NER algorithm to a training set containing a plurality of other digital text documents containing full names to detect a plurality of personal-type entities and a plurality of non-personal-type entities; automatically detecting relationships between a plurality of personal-type entities and a plurality of non-personal-type entities by applying a part-of-speech (POS) tagging algorithm and a dependency parsing algorithm to sentences of a plurality of other digital text documents, each sentence containing at least two named entities of the plurality of personal-type entities and the plurality of non-personal-type entities; automatically generating a training knowledge graph, the nodes of the training knowledge graph including nodes of a plurality of personal-type entities and a plurality of non-personal-type entities that are associated with each other, and edges thereof including respective ones of the relationships; automatically generating a particular knowledge graph, the nodes of the knowledge graph including nodes of at least one personal-type entity and at least one non-personal-type entity that are associated with each other, and edges thereof including respective ones of at least one relationship; at least one full name of at least one person-type entity is automatically determined by cross-referencing the specific knowledge graph and the training knowledge graph.

In some embodiments, the cross-referencing is based on at least one of: graph matching techniques, and boolean satisfiability problem (SAT) representation and solution techniques.

In addition to the exemplary aspects and embodiments described above, further aspects and embodiments will become apparent by reference to the drawings and by study of the following detailed description.

Drawings

Exemplary embodiments are shown in the drawings. The dimensions of the features and characteristics shown in the figures are generally chosen for convenience and clarity of presentation and are not necessarily shown to scale. These figures are listed below.

FIG. 1 is a block diagram of a system for automatically detecting personal information in free text, according to one embodiment;

FIG. 2 is a flow diagram of a method for automatically detecting personal information in free text, according to one embodiment;

FIG. 3 is a diagram of an exemplary NER, POS tag, and dependency resolution result, according to one embodiment;

FIG. 4 is a flow diagram of a method for implicitly inferring the full name of only person-type entities mentioned in a digital text document, according to one embodiment.

Detailed Description

Automatic detection of personal information in free text is disclosed herein. The detection utilizes a specific configuration of Natural Language Processing (NLP) techniques and optionally graph theory techniques to detect the presence of personal information about real persons in text.

First, a Named Entity Recognition (NER) algorithm may be applied to a digital text document suspected of containing personal information to detect named entities appearing in the document. These named entities may include personal-type entities (i.e., partial or full names of real people) as well as other types of entities such as organizations, locations, or nationalities, to name a few. Other examples are given further below.

Next, relationships between named entities are detected, such as a pair of person-type entities or a relationship between a person-type entity (person-type entity) and a non-person-type entity (non person-type entity). This may be performed by applying a part-of-speech (POS) tagging algorithm and a dependency parsing algorithm to sentences of documents containing previously detected named entities. This step generates POS tags for words in the sentences and syntactic dependencies between words. Based on these products, a dependency path connecting every two named entities in each sentence can be determined and the textual expressions located within the path are selected. The expression serves as a descriptor of the type of relationship between each two entities.

Then, an estimation can be performed as to whether any relationship type (person-to-person or person-to-non-person) between the entities indicates personal information associated with the respective personal type entity. For example, when a person-to-person relationship is detected, the relationship itself may be regarded as personal information of one or both of these persons, or the relationship plus the name of person a may be regarded as personal information of person B (e.g., "married" relationship between "Andrey" and "Orli" may be regarded as personal information of each of Andrey and Orli, even though the fact that each of them has married without specifying a spouse may be regarded as personal information of each person). Optionally, a score is calculated for the entire document and/or each personal-type entity present therein to quantify the probability that the document does contain personal information, the amount of personal information detected, and/or the severity of the personal information.

For documents that only implicitly reference a person-type entity, such as only referring the corresponding person by one of first name, last name, nickname, acronym, etc., a technique is disclosed to infer the full name of the person-type entity. The technique includes knowledge graph-based training of other documents containing the complete person's name, and then comparing the training output to knowledge graphs referenced by implicit person-type entities.

As used herein, the term "personal information" (also referred to as "personal data," "personal identification information" (PII), or "sensitive personal information" (SPI)) may refer to any information relating to a real person, including information (a) such as a name, identification number, social security number, driver's license number, date and place of birth, mother's maiden name, residential address, Internet Protocol (IP) address, email address, telephone number, or biometric record that may be used to distinguish or track the identity of the person, (b) information linked or linkable to a person such as the person's medical, educational, financial, and employment information. The terms "personal information", "personal data", PII and SPI are legal concepts, not technical concepts, and their use and meaning may vary depending on the jurisdiction and regulations. Thus, embodiments of the present invention may be conveniently adapted to detect various types of such information in free text, as desired by users of the present invention. For example, if the "personal information" sought by the user definition is to refer to food that a person eats, the NER algorithm can be directed to also look for food type entities, and the detection of relationships between named entities can focus on textual expressions that include stems such as "eat", "digest", or "enjoy", which indicate that the person has eaten a certain food.

Referring now to FIG. 1, shown is a block diagram of an exemplary system 100 for detecting personal information in free text, according to one embodiment. The system 100 may include one or more hardware processors 102, Random Access Memory (RAM)104, and one or more non-transitory computer-readable storage devices 106.

The storage device 106 may have stored thereon program instructions and/or components configured to operate the hardware processor 102. The program instructions may include one or more software modules, such as personal information detection module 108. The software components may include an operating system with various software components and/or drivers for controlling and managing general system tasks (e.g., memory management, storage device control, power management, etc.) and facilitating communication between the various hardware and software components.

The system 100 may operate by loading the instructions of the personal information detection module 108 into the RAM 104 while the processor 102 is executing. The instructions of personal information detection module 108 may cause system 100 to receive free text 110, process and output an estimate of personal information 112 contained in the text.

The system 100 as described herein is merely an exemplary embodiment of the present invention and may be implemented in practice in hardware only, in software only or in a combination of both hardware and software. The system 100 may have more or fewer components and modules than shown, may combine two or more components, or may have a different configuration or arrangement of components. System 100 may include any additional components that enable it to function as an operational computer system, such as a motherboard, data bus, power supply, network interface card, etc. (not shown). The components of system 100 may be co-located or distributed (e.g., in a distributed computing architecture).

Instructions of personal information detection module 108 are now discussed with reference to the flowchart of fig. 2, which illustrates a method 200 for detecting personal information in free text, according to one embodiment.

The steps of method 200 may be performed in the order in which they appear, or may be performed in a different order (or even in parallel), as long as the order allows the necessary input to a step to be obtained from the output of an earlier step. Additionally, the steps of method 200 are performed automatically (e.g., by system 100 of fig. 1), unless specifically noted otherwise.

First, free text may be received, for example, in the form of a digital text document (hereinafter "document") 202. As known in the art, the term "free text" (also referred to as "free-form text") refers to text written primarily in the form of words and sentences, as opposed to "structured" text, which is typically simply a word or multi-word expression that is stored separately in a database table or the like.

The document 202 may also include graphics (e.g., images, graphics, illustrations, etc.), but these graphics may optionally be omitted in subsequent steps of the method 200. Alternatively, Optical Character Recognition (OCR) as known in the art may be applied to the graphics and the recognized text considered part of the document 202 for purposes of subsequent steps of the method 200.

Receipt of the document 202 may be manual, such as by a user loading the document into the system 100 (FIG. 1), or automatic, such as by the system periodically (or by some other predefined trigger) retrieving the document from memory or network storage resources.

In an optional preprocessing step 204, the text in the document 202 may be analyzed and refined to produce more accurate results later using NLP techniques. The preprocessing step 204 may include one or more of the following actions:

first, the language of the written document 202 is detected. If most of the text in the document 202 is in the first language, but some of the text (e.g., less than 5% of the total amount of text) is in one or more other languages, then it may be determined for purposes of the method 200 that the predominant language of the document is in the first language. Since most NER algorithms available today are language specific, language detection may select a NER algorithm (to be applied in step 206) designed to run on the relevant language.

Second, unnecessary elements are removed, such as writing spaces (e.g., a space larger than one unit of the inter-character space) and/or technical characters that are not part of the text but how the text is saved in the digital file-e.g., characters representing line breaks (e.g., "\ n"), carriage returns (e.g., "\\ r"), tab characters (e.g., "\ t"), etc., as is known in the art.

Third, spelling and printing errors in the text are corrected, for example, using an automatic spell checker.

Fourth, each pronoun associated with a person-type entity is replaced with a noun that is the name of the person-type entity. For example, consider the following text, in which the pronoun "he" is used:

"John (John) went to Jessica's home yesterday. He brought his famous homemade cookies. "

In order for subsequent NLP techniques to correctly analyze this text, "John" may be substituted for "he":

"John yesterday went to Jessica's home. John brings his famous homemade cookies. "

Pronoun-noun substitution may be facilitated by applying NER and POS tags to text to detect individual type entities and grammatical structures of each sentence in the text. In the above example, "he" is replaced with "John" because they are all the subject of the corresponding sentence, and it can be safely assumed that he or she intended the text author to be "John" when writing "he". Similar rules for substituting terms with nouns will become apparent to those skilled in the art.

Because pronoun-noun replacement may require the use of NER and POS tags, it may be selected not to be performed as a separate step, but rather after the NER and POS tags are applied to document 202 in steps 206 and 208 described below.

Next, in step 206, a NER (also referred to as "entity identification," "entity chunking," or "entity extraction") algorithm may be applied to the document 202 to detect named entities (also referred to herein simply as entities) appearing therein. As known in the art, the NER algorithm may locate and classify named entities mentioned into predetermined categories, such as people, organizations, objects, locations, nationality, time, date, address, artwork, events, marital status, occupation, money, language, quantity, and the like. Many NER algorithms today are capable of providing named entity classes with a very high granularity, e.g. it is possible to determine not only that a certain term is "object" but also that the object is "vehicle", "food" or "appliance". To simplify the discussion, all types of entities that are not "people" are referred to herein as "non-personal entities". Each of these non-personal entities may in some cases constitute personal information. For example, "< person > works at < organization >", "< person > owns < object >", "< person > owns < number > children", "< person > supports < religious group >" and "i go to access < person > of < location >", may all be considered personal information associated with the person.

The NER algorithm may be an algorithm that uses a language grammar based technique and/or a statistical model (e.g., machine learning).

Prominent NER algorithms that may be used for step 206 are, for example, those included in the following software packages: natural Language Understanding (Natural Language Understanding) by IBM corporation; spaCy, supplied by explomion ai GmbH, germany; and the Natural Language Toolkit (NLTK), online website www.nltk.org, Steven Bird, Edward Loper and Ewan Klein (2009), "natural language processing using Python", O' Reilly Media, inc.

Optionally, a variety of different NER algorithms are applied to document 202 to enhance detection of named entities. If there is a divergence between the different NER algorithms with respect to the classification of a certain named entity, a set of predefined rules may be used to decide on the divergence. For example, if one NER algorithm classifies "Wendy's" as an organization and another classifies it as a person in conflict, the predefined rules may dictate that certain types of NER algorithms (e.g., machine learning based algorithms rather than grammar based algorithms) are more privileged when people/organizations conflict. Similar rules may be defined for various types of conflicts, and such rules may take into account the entity types involved and the NER algorithm types. Another form of divergence between different NER algorithms may be related to the length of the named entity detected. Such as "Neymar da Silva SantosJ Nior" Neymar Jr ". Also, a rule may be defined such that it selects a name generated by a certain type of NER algorithm instead of another, or for example selects a longer name.

Further optionally, filtering of the detected named entities is performed to remove detection of named entities that do not meet certain predefined criteria. For example, named entities of less than X characters (e.g., three characters) may be deleted, or named entities that include some non-alphabetic character (e.g., R2-D2 contain numbers; but hyphens may be allowed as they are common in certain names of people). Another example is the use of white and/or black lists of names, which consist of available lists of names and fictional names, respectively, provided to real persons. Such filtering may be required if the NER algorithm used relies solely or primarily on textual structures, and is therefore not configured to distinguish real human entities from fictional names (e.g., movie character names, mystery humans, etc.). However, many existing NER algorithms are able to distinguish quite reliably.

Further optionally, similar named entities may be merged to handle cases where someone (or other type of entity) is mentioned differently in various portions of document 202 purely due to the writing style, rather than because the mentions are different people. For example, if document 202 intermittently mentions portions of a person's full name, then all of these partial mentions may be converted to the person's full name. For example, if document 202 occasionally refers to "JK Rowling" (JK raline), "joane Rowling" (joannine), joannine (joannine), "joannine R. (joannine R)" and "j.k.rowling" (j.k. raline), merging may include replacing each of these references with "joannine Rowling" or with any one of the references whose characteristics have been previously defined as more desirable (e.g., length, number of words, etc.). This merging should not be confused with the technique for inferring the full name of a person-type entity discussed below with reference to FIG. 4. This merging applies to the case where the document 202 does mention the full name of the person concerned at least once, and then mentions him or her with a partial name. The technique of fig. 4 relates to the case where only implicit, partial mentions of the person are available in a certain document.

When step 206 ends, it provides an output of the detected named entities, each given as a combination of class and name. Optionally, the output further includes an annotation of the location of each named entity in each found sentence. For example, the output associated with the sentence "Pavlova (Pavlova doll) always eating Pavlova (Pavlova doll) after her performance" might include the following 3 tuples: (Pavlova, human, 1-1) and (Pavlova, food, 4-4) indicating that the person type entity Pavlova extends between word index positions 1 and 1 of the sentence (i.e., it is a word), while the food type entity Pavlova extends between positions 4 and 4 of the sentence. Of course other types of indices, such as letter-based indices, may be used for the symbols. The location tags may also include the location of the sentence itself in the document 202, if desired. If no named entity is detected at all, or if only a non-human type of named entity is detected, the method 200 may terminate because the document 202 is likely to contain no personal information.

In step 208, relationships between the named entities output by step 206 may be detected. This may involve applying a POS tagging (sometimes referred to as "grammar tagging" or "word class disambiguation") algorithm and a dependency resolution algorithm to at least the sentences of the document 202, each sentence containing at least two detected named entities. Alternatively, a single algorithm may be used that satisfies both functions.

The application of the POS tagging algorithm may tag words in a sentence with parts of speech (e.g., adjectives, prepositions, adverbs, conjunctions, articles, nouns, subtense, pronouns, verbs, etc.) of its grammar. The POS tagging algorithm may also tag multiple word expressions that collectively constitute a part of speech, such as compound nouns, compound adverbs, and the like.

Application of a dependency resolution algorithm may detect syntactic dependencies (sometimes referred to as syntactic "relationships") between words of a sentence. Each dependency may be between a word commonly referred to as a "head" and another word that depends on and modifies that head. Some exemplary relationships that may be detected by common dependency resolution algorithms include "root", "clause modifiers of nouns" (often abbreviated as "acl"), "clause complements" (often abbreviated as "ccomp"), and many other relationships.

The exemplary software packages mentioned above, natural language Understanding (natural language Understanding), spaCy and NLTK, include POS tags and dependency resolution algorithms that may be used in step 208.

To determine relationships between named entities, the results of POS tags and dependency resolution may be utilized as follows:

syntactic dependencies detected in each sentence can be traversed to determine a dependency path in the sentence connecting each pair of named entities (e.g., a pair of person-type named entities, or one person-type entity and one non-person-type entity). That is, pairs of named entities may not be directly connected by a single grammatical dependency but by a series of words having a pair-wise dependency between them, and generally form an indirect connection between two named entities.

A textual expression (having one or more words) located within the dependency path may then be selected based on the POS tags of the words in the path. For example, the selected textual expression may have a first word that is a verb or adjective, and a second word (e.g., an expression such as "work at") that is not a named entity and that is syntactically dependent on the first word. As another example, the first word in the textual expression "dos not word for" is a verb, the second is a negative word, and the third and fourth are preposition verbs. Another example is an expression such as "break out of", which is a phrase verb that can be detected by matching with a phrase dictionary.

For example, consider the sentence "j.k.rowling earned improvement $54million last year" (j.k.rowlin earned about 5400 ten thousand dollars in last year) "whose NER, POS tag and dependency analysis results are shown in fig. 3. The POS tag algorithm detects the person named entity "j.k.rowling" and the entity named in money type "$ 54million (5400 ten thousand dollars)". It also detects "last year" as a date type named entity, but in this example, the type does not matter.

The dependency resolution algorithm confirms that these two entities are indirectly connected through two grammatical dependencies: the direct object (dobj) dependency of "$ 54million (5400 ten thousand dollars)" is "earned", and the dependency of "earned" on the nominal subject (nsubj) is "j.k.rowling"

The path between two named entities of interest is the word "earned adaptive". To select a textual expression in the path, it is first determined that "earned" is a Verb (VBD) and therefore should be the first word in the expression. The word "approximate" is not a recognized named entity, but rather relies on "earned" (indirectly by "$") and should therefore be used as the second word in the expression. Thus, the selected textual expression is "earned approximate". In this example, all words in the path have been selected as textual expressions, but in practice, the path may include one or more words that have not been selected as part of an expression. For example, in "earned approximate", the second word may be ignored because it does not contribute much to the concept of personal information by revealing how much money a person earns.

Reference is now made back to fig. 2. When the method 200 is applied to many different documents, there may be a large number of textual expressions of different wording. Therefore, it may be advantageous to define the relationship of step 208 with a smaller number of options. Thus, a predefined list of relationship types may be provided, and the similarity of the selected textual expressions to these types may be analyzed to associate each expression with the most similar type. As is known in the art, similarity analysis may utilize NLP techniques to determine lexical and/or semantic similarities between, for example, textual expressions and predefined relationship types.

The predefined relationship types may include types that may be associated with personal information, such as suggesting an association between a person-type named entity and information, such as national identification number, social security number, driver license number, date and place of birth, mother's maiden name, residential address, Internet Protocol (IP) address, email address, phone number, medical condition, educational background, financial data, employment records, and the like. By way of example only, textual expressions such as "work at", "services as a < role >," ws fixed from (employed) may be associated with a predefined relationship type representing employment data of a person, such as the relationship type of "employee by".

In step 210, it may be estimated whether each relationship between a person-type entity and another person-type entity or a non-person-type entity indicates certain personal information that is revealed by: personal type entities and relationships (e.g., "Andrey has married"); or personal-type entities, relationships, and other entities (e.g., "Andrey and Orli marriage" or "Andrey is a computer scientist"). The estimation may be based on one or more factors, such as a set of predefined scores associated with the relationship type (determined in step 208) and/or a set of predefined scores associated with the non-personal entity type (detected in step 206). Each of these scores indicates the likelihood of a relationship type or a non-personal entity type revealing or belonging to personal information, respectively. For example, a relationship type that indicates that a person has a certain medical condition may receive a relatively high score, while a relationship type that indicates that a person lives in a certain country may receive a relatively low score. With respect to non-personal entity types, for example, a "illness" type may receive a relatively high score, while a "occupation" type may receive a relatively low score.

For each instance of a document in which a personal entity type, a relationship, and a personal or non-personal entity type appear in order, the relationship type score and the non-personal entity type score may be combined into a unified score. The combination of scores may be based on a simple addition, multiplication, or some other logic that amplifies the effect of the relationship type score or the non-personal entity type score based on each user preference.

To calculate the privacy score for each person mentioned in the document 202, the uniform scores associated with the instances of the person-type named entity may be aggregated, such as by simple addition or by some more complex logic, according to the user's preferences. Thus, a higher privacy score will represent an estimate that the document 202 includes a large amount of personal information about the individual, while a lower privacy score will represent the opposite estimate.

It is also possible to calculate the privacy score of each document 202 as a whole to indicate the estimated amount of personal information it includes. This may be performed by aggregating privacy scores computed for all personal-type named entities appearing in the document. Aggregation may be done by simple addition or by some more complex logic, according to the user's preference.

In step 212, a notification of the evaluation result may be issued. This may include, for example, displaying the privacy score of each person and/or the privacy score of each document on a computer display, sending the scores in an electronic message, adding the scores as attributes to the document 202 (i.e., a digital file storing the document), recording the scores in a database that stores the documents and privacy score data, and so forth. Alternatively, based on the notification, the document 202 may be manually or automatically transferred to a storage location that complies with the relevant specification. For example, if it is determined (e.g., by the privacy score exceeding a predefined threshold) that the document 202 includes personal information for a personal-type entity residing in the european union (or regulated by european union regulations, such as GDPR), the document may be transferred to physical storage within the european union. The same reasoning applies to other jurisdictions and regulations. Conversely, if it is determined that the document 202 does not have personal information, the document may be transferred to physical storage based on various technical and/or business considerations, whether at all or in the case of a certain regulation/jurisdiction, without regard to the regulation.

Referring now to FIG. 4, a flow diagram is shown of a method 300 for inferring full names of personal-type entities that are only implicitly referenced in a digital text document, according to an embodiment. Alternatively, the method 300 may sometimes be unsuccessful in inferring the full name of an implied person-type entity, but may still narrow the answer to a small group of people who may find a match, which may require additional human labor. The steps of method 300 may be part of the instructions of personal information detection module 108 of fig. 1.

The steps of method 300 may be performed in the order in which they appear, or may be performed in a different order (even in parallel), as long as the order allows the necessary input for a step to be retrieved from the output of an earlier step. Additionally, the steps of method 300 are performed automatically (e.g., by system 100 of fig. 1), unless specifically noted otherwise.

The method 300 may be complementary to the method 200 of fig. 2, in the case where the document 202 only implicitly references a person-type entity, for example, by only one of first name, last name, nickname, acronym, etc. (hereinafter referred to as "partial" person name), since the output of the former (i.e., the inferred full name) is used to enhance the output of the latter (i.e., personal information estimates about one or more person-type entities).

First, in method 300, a training set 302 is received, the training set including a plurality of digital text documents (hereinafter "documents"), including full names.

In an optional preprocessing step 304, the text in the training set 302 can be analyzed and refined so that subsequent applications of NLP techniques can obtain more accurate results. Step 304 is essentially similar to step 204 of method 200, except for the fact that: various preprocessing operations will be applied to multiple documents rather than one document. Accordingly, the description of step 204 of method 200 applies mutatis mutandis.

In step 306, the NER algorithm may be applied to the training set 302 to detect a plurality of person-type entities and a plurality of non-person-type entities. These documents may be documents obtained from the same organization or same domain as document 202 of fig. 2, so at least one of these documents (and even better, a large number of documents) may refer to the implicit personal-type entity in its full name. For example, documents may be retrieved from a data store of a human resources department of an organization, and therefore, they should contain personal information about many employees and mention most or all of them in their full name.

Step 306 may be similar in nature to step 206 of method 200, except for the fact that: the NER algorithm applies to multiple documents rather than one document. Accordingly, the description of step 206 of method 200 applies mutatis mutandis.

When step 306 ends, it provides output for a plurality of person-type entities and a plurality of non-person-type entities, each entity being given as a combination of class and name.

In step 308, the relations (all types of relations are collectively referred to as "mutual relations") between the personal type entities themselves or between the personal type entities and the non-personal type entities can be detected by applying the POS tagging algorithm and the dependency parsing algorithm to those sentences of the plurality of documents, which contain a pair of the plurality of personal type and non-personal type entities, the pair consisting of two personal type entities or one personal type entity and one non-personal type entity.

Step 308 may be similar in nature to step 208 of method 200, except for the fact that: POS tagging algorithms and dependency resolution algorithms are applied to a large number of sentences obtained from multiple documents. Accordingly, the description of step 208 of method 200 applies here mutatis mutandis.

In step 310, a training knowledge graph including nodes and edges may be generated. The nodes of the training knowledge graph include a plurality of person-type entities and those plurality of non-person-type entities that are related to any of the plurality of person-type entities. The edges of the training knowledge graph, in turn, include interrelationships of a plurality of person-type entities and a plurality of non-person-type entities. Thus, the training knowledge graph describes detected person-type-to-person-type relationships and/or person-type-to-non-person-type relationships that appear in a plurality of documents. The generation of the training knowledge graph may utilize any suitable knowledge graph generation software known in the art configured to receive as input the definitions of nodes and edges and output the graph. Notably, the term "graph" does not imply that a graph needs to be presented graphically; rather, the graph may exist purely as computer code characterizing the graph, such as extensible markup language (XML) code or code of any other suitable programming or markup language.

The training knowledge graph generated in step 310 may be stored in non-volatile memory for future use, for example, when a partial person name is detected in a document being analyzed by method 200 of FIG. 2.

Accordingly, in step 312, an indication is received that a partial name was detected. This detection may be performed, for example, during or after execution of step 206 of method 200 when providing output of at least one person-type entity. The name of the person-type entity is analyzed to determine whether it is a full name or a partial name, for example, by checking the name against a set of rules that define what is considered to be a full name and/or a partial name. Such rules may specify, for example, that a partial name is a name that satisfies one or more of the following conditions (full name is any name that is not a partial name): it contains only a single word (e.g., "Jean" or "Picard"); it contains only two words connected by a hyphen (e.g., "Jean-Luc"); it consists of either of the first two options, plus one or more single letters (e.g. "j. Picard", "j.l. Picard") or plus one or more words (e.g. "JL Picard") all in upper case. Other possible rules will become apparent to those skilled in the art. For example, as is known in the art, a rule may be implemented by a regular expression (RegEx).

In step 314, a particular knowledge graph of the document 202 of the method 200 may be generated, respectively. The nodes of the particular knowledge graph include at least one person-type entity of step 206 of method 200 and those of the at least one (typically, a plurality of) non-person-type entities of step 206 that are related to any of the at least one person-type entity. Additionally or alternatively, the nodes of a particular knowledge graph include multiple person-type entities that are related to each other at step 206 of method 200. The edges of the particular knowledge graph then include correspondences between pairs of entities, whether human-to-human relationships or human-to-non-human relationships, determined in step 208 of method 200. Thus, detected relationships are described that all occur between at least one person-type entity and (a) another of the at least one person-type entity or (b) at least one non-person-type entity of document 202 of method 200.

Statistically, since one or more of the plurality of documents forming the training set are likely to include information about at least one person-type entity mentioned in the documents 202 of the method 200, the documents 202 and the information contained in the training set may be cross-referenced to infer the full name of the person implied in the documents 202.

Thus, in step 316, at least one full name of at least one person-type entity of the document 202 of the method 200 is determined, for example, by cross-referencing the particular knowledge graph with the training knowledge graph.

Alternatively, if a single full name of at least one person-type entity of the document 202 of the method 200 cannot be determined (e.g., because the training set does not include sufficient conclusion information), then step 316 may determine a set of possible full names, which may be at least one person-type entity, or a set of all employees that do not have full names but have some other characteristic-such as a particular branch in a set of organizations.

The cross-referencing of step 316 may include, for example, one or more pattern matching techniques known in the art. Graph matching techniques typically operate by searching for subgraphs of a larger graph (in our case, a training knowledge graph) that has the highest similarity to some smaller graph (here, a particular knowledge graph). An example of a suitable pattern matching technique is one proposed in Cordella, L.P and Foggia, P & Sansone, Carlo (2001). An improved algorithm for matching large graphs, the graph-based representation of the third IAPA-TC15 seminar, 149-159, in the third IAPR-TC15 pattern recognition.

Alternatively, the cross-reference may be represented as a boolean satisfiability problem ("SAT"). As is known in the art, this may be performed by creating a separate SAT formula (also referred to as an "expression") for each person-type entity of a particular knowledge graph. First, a set of size n of all entities related to a corresponding person-type entity in a particular knowledge graph is found. For example, if "Andrey" is a person-type entity, and he is related to: (a) by the "residency" relationship with the "Israel" location type entity, (b) by the "employed" relationship with the IBM "organization type entity," and (c) by the "married to" relationship with the "Orli" personal type entity, the set of entities is { Israel, IBM, Orli } (n ═ 3).

Next, for each subset of the entity set (i.e., 2)ⁿEach of 1 possible permutations, without empty sets), a rule is created by placing a logical AND condition between all entities in the subset. All of these rules are then tied to the logical OR condition to create the SAT formula for the personal entity.

Continuing with the previous example, a SAT formula with seven rules will be created:

[ (X inhabits Israel) AND (X was hired by IBM) AND (X has married with Orli) ]

OR [ (X employed by IBM) AND (X married with Orli) ]

OR [ (X living in Israel) AND (X married with Orli) ]

OR [ (X inhabited Israel) AND (X hired IBM) ]

OR (X living in Israel)

OR (X hired in IBM)

OR (X and Orli married)

If multiple documents include a large amount of information about various personal-type entities, represented by many relationships between each of these personal-type entities and non-personal-type entities, the SAT formula for each personal-type entity may become lengthy, containing many rules. Thus, it may be desirable to reduce the number of rules per formula, for example, by limiting the size of each possible subset to some predetermined range. For example, if the size limit is between 2 and 5, then rule 5-7 in the above example would not be created. While this means that not all possibilities are explored, the resulting formula is still sufficient to reliably associate partial names in a particular knowledge graph with full names in a training knowledge graph.

After the SAT formulas are created, the SAT problem may be solved by evaluating the satisfiability of each formula with various full names from a training knowledge graph, as is known in the art. For example, if the training knowledge graph includes full names of the following three person-type entities: "Andrey Finkelshtein", "Eitan Menahem" and "Bar Haim", then the above 7 rule formulas will be filled with these three names and checked for condition based on the training knowledge map. Even though all three satisfy the conditions of "hired by IBM" and "living in israel" in terms of nodes and edges of the training knowledge graph, only "Andrey Finkelshtein" satisfies the condition of "married with Orli". Thus, "Andrey Finkelshtein" will accumulate more TRUE rule checks (7) than "Eitan Menahem" and "Bar Haim" (3 each).

Accordingly, determining that "Andrey Finkelshtein" is the full name of the same person-type entity that appears only as a partial name ("Andrey") in document 202 of method 200 may be based on a voting process, wherein the person-type entity with the most number of TRUE rule checks will be selected. Alternatively, rules with more conditions may be given more weight to amplify their impact on the final count of TRUE checks. For example, each rule may be given a weight equal to its conditional number. In the example above, rule number 1 would get a weight of 3, rules 2-4 would each get a weight of 2, and rules 5-7 would get a weight of 1, respectively. This would result in "Andrey Finkelshtein" getting a 3+2+2+2+1+1+ 12 ticket, while "Eitan Menahem" and "Bar Haim" each have only a 2+1+ 14 ticket. This does not affect the voting results of the simple 3-entity example given here, but may be useful when the training knowledge graph includes many nodes and many edges, and multiple full names may satisfy those rules with a small number of conditions.

Whether the cross-referencing employs the SAT approach or the graph matching approach, it may result in multiple full name matches for the person-type entity. To determine which of these matches may be correct, one or more heuristics may be applied: an exemplary heuristic is to apply string similarity algorithms known in the art (e.g., Levenshtein distance algorithm) to the partial names and to a word in each full name (possibly the person's first or last name). For example, assuming all three full names in the previous example were determined to match some partial name "Andre" (because these three did happen to be married to Orli), the Levenshtein distance algorithm may be applied to "Andre" to match the first word of each full name: "Andrey" (distance equal to 1), "Eitan" (distance equal to 5), and "Bar" (distance equal to 4). This means that "Andrey" is the closest match of the three, "Andre" may be a misspell or a nickname for Andrey.

As described above, in some cases, there may not be enough information in the training set at all to be able to sort out full names that match partial names of the person-type entities. Thus, if cross-referencing produces multiple full name matches, an upper threshold (T) may be applied to the matches, such that only the T best matching full names ("best" output by the SAT solver or the graph matcher) are output and presented to the user.

To provide an even more meaningful output to the user, the relationships of multiple full names previously detected may be analyzed, and any relationship that characterizes all or most of these full names may also be output. For example, all of these full names may share the same relationship, indicating that they are dedicated to IBM's Haifa research laboratory. This can be a very useful output, since it means that a certain document contains personal information about a certain IBM Haifa research laboratory employee, even if his or her full name cannot be identified.

The present invention may be a system, method and/or computer program product in any combination of possible technical details. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

Computer program instructions for carrying out operations of the present invention may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine related instructions, microcode, firmware instructions, state setting data, integrated circuit configuration data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The description of a range of values should be considered to have specifically disclosed all the possible subranges as well as individual values within that range. For example, description of a range of 1 to 6 should be considered to have explicitly disclosed some sub-ranges, such as 1 to 3, 1 to 4, 1 to 5, 2 to 4, 2 to 6, 3 to 6, etc., as well as individual numbers within that range, such as 1, 2, 3, 4, 5, and 6. Regardless of the breadth of the range, this applies.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method comprising operating at least one hardware processor to:

automatically applying a Named Entity Recognition (NER) algorithm to a digital text document to detect a named entity appearing in the digital text document, wherein the named entity is selected from the group consisting of: at least one person-type entity, and at least one non-person-type entity;

automatically detecting at least one relationship between the named entities by applying a part-of-speech (POS) tagging algorithm and a dependency resolution algorithm to sentences of the digital text documents containing the detected named entities;

automatically estimating whether at least one relationship between the named entities represents personal information; and

and automatically sending out a notice of the estimation result.

2. The method of claim 1, further comprising: operating at least one hardware processor to replace pronouns related to the at least one personal-type entity with nouns of the name of the at least one personal-type entity in the digital text document.

3. The method of claim 1, further comprising operating at least one hardware processor to:

automatically pre-processing the digital text document prior to automatically applying the NER algorithm by at least one of:

(a) detecting a primary language of the digital text document, thereby selecting an NER algorithm to match the primary language;

(b) removing from the digital text document at least one of: blank and technical characters; and

(c) correcting spelling errors in the digital text document.

4. The method of claim 1, wherein the at least one non-person type entity is selected from the group consisting of: organization, object, location, nationality, time, date, address, artwork, event, marital status, occupation, money, language, and quantity.

5. The method of claim 1, further comprising operating at least one hardware processor to: automatically applying a different Named Entity Recognition (NER) algorithm to the digital text document; and applying one or more predefined rules to resolve one or more conflicts between named entities detected by the NER algorithm and a different NER algorithm.

6. The method of claim 1, further comprising operating at least one hardware processor to filter the named entities and merge at least some of the named entities.

7. The method of claim 1, wherein the automatically detecting at least one relationship between the named entities further comprises:

determining a dependency path connecting every two named entities in each sentence using the results of the applied dependency resolution algorithm;

selecting a text expression located within the dependency path; and

associating each of the textual representations with a relationship type selected from a predefined set of relationship types.

8. The method of claim 7, wherein the automatically estimating comprises calculating a privacy score for the digital text document or a privacy score for each of the at least one person-type entity based on:

a first set of predefined scores associated with the relationship types, wherein each score of the first set indicates a likelihood that the respective relationship type is part of personal information; and

a second set of predefined scores associated with the named entities, wherein each score of the second set indicates a likelihood that the respective named entity is part of personal information.

9. The method of claim 7, further comprising operating at least one hardware processor to:

automatically detecting that the at least one person-type entity includes at least a portion of a person name;

automatically applying an NER algorithm to a training set containing a plurality of other digital text documents containing full names to detect a plurality of personal-type entities and a plurality of non-personal-type entities;

automatically detecting relationships between a plurality of personal-type entities and a plurality of non-personal-type entities by applying a part-of-speech (POS) tagging algorithm and a dependency parsing algorithm to sentences of a plurality of other digital text documents, each sentence containing at least two named entities of the plurality of personal-type entities and the plurality of non-personal-type entities,

automatically generating a training knowledge graph, the nodes of the training knowledge graph including nodes of the plurality of person-type entities and the plurality of non-person-type entities that are associated with each other, and edges thereof including respective ones of the relationships;

automatically generating a particular knowledge graph, the nodes of the knowledge graph including nodes of at least one personal-type entity and at least one non-personal-type entity that are associated with each other, and edges thereof including respective ones of at least one relationship; and

at least one full name of at least one person-type entity is automatically determined by cross-referencing the specific knowledge graph and the training knowledge graph.

10. The method of claim 9, wherein the cross-referencing is based on at least one of: graph matching techniques, and boolean satisfiability problem (SAT) representation and solution techniques.

11. A system, comprising:

(a) at least one hardware processor; and

(b) a non-transitory computer-readable storage medium having program code embodied thereon, the program code being executable by the at least one hardware processor to perform the steps embodied by the method of any of claims 1 to 9.

12. A computer program product comprising a computer readable hardware storage device storing computer readable program code, the computer readable program code comprising an algorithm, which when executed by a processor of a hardware controller, implements the steps comprised by the method of any one of claims 1 to 9.

13. An apparatus comprising one or more modules configured to implement the steps included in the method of any one of claims 1 to 9.