CN112329055A - Method and device for desensitizing user data, electronic equipment and storage medium - Google Patents

Method and device for desensitizing user data, electronic equipment and storage medium Download PDF

Info

Publication number
CN112329055A
CN112329055A CN202011206902.2A CN202011206902A CN112329055A CN 112329055 A CN112329055 A CN 112329055A CN 202011206902 A CN202011206902 A CN 202011206902A CN 112329055 A CN112329055 A CN 112329055A
Authority
CN
China
Prior art keywords
sensitive information
information
text
desensitized
desensitization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011206902.2A
Other languages
Chinese (zh)
Inventor
王磊
刘磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Weiyiyun Hangzhou Holding Co ltd
Original Assignee
Weiyiyun Hangzhou Holding Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Weiyiyun Hangzhou Holding Co ltd filed Critical Weiyiyun Hangzhou Holding Co ltd
Priority to CN202011206902.2A priority Critical patent/CN112329055A/en
Publication of CN112329055A publication Critical patent/CN112329055A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method and a device for desensitizing user data, electronic equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring a data desensitization request; the data desensitization request comprises a text to be desensitized and a plurality of desensitization information categories; aiming at each desensitization information category, identifying the text to be desensitized according to a sensitive information funnel model corresponding to the desensitization information category to obtain a sensitive information position in the text to be desensitized; and replacing sensitive information with designated characters in the sensitive information position of the text to be desensitized to obtain a desensitized text. According to the scheme, the targeted identification is carried out on the text to be desensitized according to the sensitive information funnels corresponding to different desensitization information categories, so that data desensitization can be accurately realized.

Description

Method and device for desensitizing user data, electronic equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for desensitizing user data, an electronic device, and a computer-readable storage medium.
Background
During the use of the internet platform, the user generates user data, which may include personal sensitive information. The personal sensitive information refers to personal information which can cause personal reputation, physical and mental health to be damaged or discriminative treatment. According to the national standard GB/T35273-2017 personal information security standard of information security technology, personal sensitive information comprises several categories: personal property information, personal health physiological information, personal biological identification information, personal identity information, network identity identification information and the like.
Disclosure of Invention
An object of the embodiment of the present application is to provide a method and an apparatus for desensitizing user data, an electronic device, and a computer-readable storage medium, which are used for performing targeted identification on a to-be-desensitized text according to sensitive information funnels corresponding to different desensitization information categories, so that data desensitization can be accurately achieved.
In one aspect, the present application provides a method of user data desensitization, comprising:
acquiring a data desensitization request; the data desensitization request comprises a text to be desensitized and a plurality of desensitization information categories;
aiming at each desensitization information category, identifying the text to be desensitized according to a sensitive information funnel model corresponding to the desensitization information category to obtain a sensitive information position in the text to be desensitized;
and replacing sensitive information with designated characters in the sensitive information position of the text to be desensitized to obtain a desensitized text.
In one embodiment, the desensitization information category is an identification number or a bank card number;
the identifying the text to be desensitized according to the sensitive information funnel model corresponding to the desensitization information category to obtain the sensitive information position in the text to be desensitized comprises the following steps:
judging whether the text to be desensitized has suspected sensitive information according to the regular expression corresponding to the desensitization information category;
if yes, verifying the suspected sensitive information according to a verification rule corresponding to the desensitization information category;
and when the suspected sensitive information is verified to be sensitive information corresponding to the desensitization information category, acquiring the sensitive information position of the sensitive information in the text to be desensitized.
In one embodiment, the desensitization information category is a telephone number, a mobile phone number, or an email;
the identifying the text to be desensitized according to the sensitive information funnel model corresponding to the desensitization information category to obtain the sensitive information position in the text to be desensitized comprises the following steps:
judging whether sensitive information exists in the text to be desensitized according to the regular expression corresponding to the desensitization information category;
and if so, acquiring the position of the sensitive information in the text to be desensitized.
In one embodiment, the desensitization information category is a home address;
the identifying the text to be desensitized according to the sensitive information funnel model corresponding to the desensitization information category to obtain the sensitive information position in the text to be desensitized comprises the following steps:
identifying each level of address information in the text to be desensitized through a trained place matching model;
according to the identified sensitive information position of each address information, determining continuous address information combinations in the text to be desensitized;
aiming at each address information combination, calculating the total score of the address information combination according to the preset weight and the preset score of each level of address information in the address information combination;
judging whether the total score of the address information combination is greater than a preset total score threshold value or not;
if yes, determining that the address information combination is sensitive information, and acquiring the sensitive information position of the address information combination.
In one embodiment, the desensitization information category is name;
the identifying the text to be desensitized according to the sensitive information funnel model corresponding to the desensitization information category to obtain the sensitive information position in the text to be desensitized comprises the following steps:
identifying the text to be desensitized through a trained name matching model to obtain suspected sensitive information in the text to be desensitized;
filtering the suspected sensitive information through a designated domain dictionary and a preset suffix filtering rule to obtain sensitive information;
and acquiring the sensitive information position of the sensitive information.
In an embodiment, after obtaining the location of the sensitive information in the text to be desensitized, the method further includes:
and returning the sensitive information position and the desensitization information category corresponding to the sensitive information to the source of the data desensitization request.
In one embodiment, the replacing sensitive information with a specified character in the sensitive information position of the text to be desensitized includes:
judging whether the different sensitive information positions in the text to be desensitized are continuous or not;
if at least two sensitive information positions are continuous, combining the continuous sensitive information positions into one sensitive information position;
and replacing the sensitive information with the specified character at the position of the combined sensitive information.
In another aspect, the present application further provides an apparatus for user data desensitization, comprising:
the acquisition module is used for acquiring a data desensitization request; the data desensitization request comprises a text to be desensitized and a plurality of desensitization information categories;
the identification module is used for identifying the text to be desensitized according to the sensitive information funnel model corresponding to each desensitization information category to obtain the sensitive information position in the text to be desensitized;
and the replacing module is used for replacing the sensitive information with the designated character in the sensitive information position of the text to be desensitized to obtain the desensitized text.
Further, the present application also provides an electronic device, including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the above method of user data desensitization.
Additionally, the present application also provides a computer readable storage medium storing a computer program executable by a processor to perform the above-described method of user data desensitization.
According to the scheme, after a data desensitization request is obtained, a corresponding sensitive information funnel model can be selected according to a plurality of desensitization information categories in the data desensitization request, so that sensitive information of various desensitization information categories is identified in a targeted manner according to the sensitive information funnel model, the position of the sensitive information in a text to be desensitized is accurately obtained, and the sensitive information is replaced by designated characters; by the above measures, accurate data desensitization can be achieved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the embodiments of the present application will be briefly described below.
Fig. 1 is a schematic view of an application scenario of a method for desensitizing user data according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a method for desensitizing user data according to an embodiment of the present application;
fig. 4 is a schematic flowchart of a sensitive information identification method according to an embodiment of the present application;
fig. 5 is a schematic flowchart of a sensitive information identification method according to another embodiment of the present application;
fig. 6 is a schematic flowchart of a sensitive information identification method according to another embodiment of the present application;
fig. 7 is a schematic flowchart of a sensitive information identification method according to another embodiment of the present application;
fig. 8 is a block diagram of an apparatus for desensitizing user data according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
Like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
Fig. 1 is an application scenario diagram of a method for desensitizing user data according to an embodiment of the present application. As shown in fig. 1, the application scenario includes a client 40 and a server 50, where the server 50 may be a server, a server cluster, or a cloud computing center; the client 40 may be an intelligent device such as a tablet computer or a personal host, and is configured to initiate a data desensitization request to the server 50; the server 50 may perform desensitization processing on the text to be desensitized in response to the data desensitization request.
As shown in fig. 2, the present embodiment provides an electronic apparatus 1 including: at least one processor 11 and a memory 12, one processor 11 being exemplified in fig. 2. The processor 11 and the memory 12 are connected by a bus 10, and the memory 12 stores instructions executable by the processor 11, and the instructions are executed by the processor 11, so that the electronic device 1 can execute all or part of the flow of the method in the embodiments described below. In an embodiment, the electronic device 1 may be the server 50.
The Memory 12 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk.
The present application also provides a computer readable storage medium having stored thereon a computer program executable by a processor 11 for performing the method of user data desensitization provided herein.
Referring to fig. 3, a flow chart of a method for desensitizing user data according to an embodiment of the present application is shown, and as shown in fig. 3, the method may include the following steps 310 to 330.
Step 310: acquiring a data desensitization request; wherein the data desensitization request comprises a text to be desensitized and a plurality of desensitization information categories.
The client may initiate a data desensitization request to the server. And the data desensitization request is used for requesting the server side to perform data desensitization on the text to be desensitized.
The text to be desensitized is the text which needs to be subjected to data desensitization, and the text to be desensitized contains personal sensitive information of the user. For example, the text to be desensitized may be a user log of an internet hospital, an operator selects the user log through a client and then initiates a data desensitization request to a server, and the server performs data desensitization on the user log.
The desensitization information category is a category into which sensitive information is pre-partitioned. In one embodiment, desensitization information categories may include identification number, bank card number, telephone number, cell phone number, email, home address, and name.
One or more desensitization information categories may be included in the data desensitization request, depending on the needs of the actual application scenario. The server side can analyze the text to be desensitized and a plurality of desensitization information categories from the data desensitization request.
Step 320: and aiming at each desensitization information category, identifying the text to be desensitized according to the sensitive information funnel model corresponding to the desensitization information category to obtain the sensitive information position in the text to be desensitized.
The sensitive information funnel model refers to a series of serial algorithms for identifying sensitive information. Because the sensitive information in each desensitization information category has different forms and contents, the processing logic of the sensitive information funnel model also has differences.
The server side can determine corresponding sensitive information funnel models according to each analyzed desensitization information category, and then identifies the desensitization texts one by the sensitive information funnel models, so that sensitive information corresponding to each desensitization information category is identified, and at the moment, the position of the sensitive information in the desensitization texts can be determined. Wherein the sensitive information position is used for indicating the absolute position of the sensitive information in the text to be desensitized.
Illustratively, the text to be desensitized is "data wrong above, patient is fiscal jun 171 x 1345, birth 3.2.1947, age 70", and sensitive information includes "fiscal jun" corresponding to the desensitization information category "name" and "171 x 1345" corresponding to the desensitization information category "mobile phone number" (the mobile phone number in the text to be desensitized is a full number, here replacing the middle 4-digit number). The server side recognizes 'fee-country juns' according to the sensitive information funnel model corresponding to 'names', and can determine that the starting position of the sensitive information is the 11 th character and the ending position of the sensitive information is the 13 th character. The server identifies '171 x 1345' according to the sensitive information funnel model corresponding to the 'mobile phone number', and can determine that the starting position of the sensitive information is the 14 th character and the ending position of the sensitive information is the 24 th character.
Step 330: and replacing the sensitive information with the designated characters in the sensitive information position of the text to be desensitized to obtain the desensitized text.
The designated characters are used to indicate the sensitive information being replaced. Illustratively, the designated character may be a plurality of consecutive characters (e.g., 4 consecutive characters as the designated character), or the designated character may be the same number of characters as the sensitive information being replaced. Subsequently, when the desensitized text is viewed, the position of the sensitive information can be known without errors according to the specified characters.
After the server identifies the sensitive information in the text to be desensitized, the sensitive information can be replaced by the designated characters in the position of the sensitive information, so that sensitive information leakage is avoided, and data desensitization is realized.
In an embodiment, the desensitization information category is an identification number or a bank card number, see fig. 4, which is a flowchart of a sensitive information identification method provided in an embodiment of the present application, and as shown in fig. 4, when the server identifies a text to be desensitized according to a sensitive information funnel model, the following steps 410 to 430 may be performed.
Step 410: and judging whether suspected sensitive information exists in the text to be desensitized according to the regular expression corresponding to the desensitization information category.
Regular expressions (Regular expressions) are used to retrieve text that conforms to a certain rule. The regular expression corresponding to the ID card number is different from the regular expression corresponding to the bank card number.
The server side can select a corresponding regular expression according to the analyzed desensitization information category 'identity card number' or 'bank card number', and further judge whether suspected sensitive information matched with the regular expression exists in the text to be desensitized. The suspected sensitive information refers to information matched with the regular expression, and the suspected sensitive information needs to be subjected to further verification to determine whether the suspected sensitive information is sensitive information. Illustratively, the server determines that 18 continuous numeric character strings exist in the text to be desensitized according to a regular expression corresponding to the identification number, and the character strings are suspected sensitive information.
Step 420: if yes, the suspected sensitive information is checked according to the checking rule corresponding to the desensitization information category.
Step 430: and when the suspected sensitive information is verified to be sensitive information corresponding to the desensitization information category, acquiring the sensitive information position of the sensitive information in the text to be desensitized.
When the suspected sensitive information corresponding to the identification number exists in the text to be desensitized, the server side can verify the suspected sensitive information according to the verification rule corresponding to the identification number. Taking an 18-bit identity card number as an example, the first 6 bits are address codes which represent administrative region codes of the place where the user is registered; 7 to 14 are the year, month and day of birth; the 15 th to 17 th digits are sequence codes which represent sequence numbers compiled for people born in the same year, month and day in the area identified by the same address code, wherein odd numbers of the sequence codes are distributed to males and even numbers are distributed to females, namely, the 17 th odd numbers represent males and the even numbers represent females; bit 18 is a check code. The server can respectively judge whether the address code, the birth date and the sequence code accord with the coding rule, further check and calculate the first 17 digits, and judge whether the calculation result is consistent with the 18 th check code. If the information is consistent with the identification information, the suspected sensitive information is determined to be an identification number and is sensitive information. In this case, the server may obtain the position of the sensitive information in the text to be desensitized.
When it is determined that the meaning sensitive information corresponding to the bank card number exists in the text to be desensitized, the server side can check the suspected sensitive information according to the check rule corresponding to the bank card number. The server can search a preset bin code library according to the first 6 digits of the suspected sensitive information, and judge whether a matched bin code exists. Wherein, the bin code of a plurality of bank cards of a plurality of banks is pre-entered in the bin code library. In one aspect, if the corresponding bin code is not found, it may be determined that the suspected sensitive information is not a bank card number. On the other hand, if the corresponding bin code is found, it can be further determined whether the length of the string of the suspected sensitive information is consistent with the length of the bank card number corresponding to the bin code. When the two are consistent, the suspected sensitive information can be determined to be a bank card number and is sensitive information. In this case, the server may obtain the position of the sensitive information in the text to be desensitized.
In an embodiment, the desensitization information category is a phone number, a mobile phone number, or an email, referring to fig. 5, which is a flowchart of a sensitive information identification method provided in another embodiment of the present application, when the server identifies a text to be desensitized according to a sensitive information funnel model, the following steps 510 to 520 may be performed.
Step 510: and judging whether sensitive information exists in the text to be desensitized according to the regular expression corresponding to the desensitization information category.
The server can select a corresponding regular expression according to the analyzed desensitization information category 'telephone number', 'mobile phone number' or 'electronic mailbox', and further judge whether sensitive information matched with the regular expression exists in the text to be desensitized. Since the sensitive information corresponding to the desensitization information categories is relatively simple, the desensitization information can be identified as corresponding sensitive information under the condition of being matched with the regular expression.
Step 520: and if so, acquiring the position of the sensitive information in the text to be desensitized.
When the server identifies the sensitive request corresponding to any desensitization information type, the server can acquire the sensitive information position of the sensitive information in the text to be desensitized.
In an embodiment, the desensitization information category is a home address, and referring to fig. 6, which is a flowchart of a sensitive information identification method provided in another embodiment of the present application, when the server identifies a text to be desensitized according to a sensitive information funnel model, the server may perform the following steps 610 to 650.
Step 610: and identifying address information of each level in the text to be desensitized through the trained place matching model.
The location matching model can be obtained through AC (Aho-Corasick automation) automatic model training. Before executing the method for desensitizing user data in the scheme, the server can train a place matching model. For example, the server may obtain the location training data and perform cleansing on the location training data. The location training data includes address information of each level, such as provincial name, city name, county name, town name, road name, cell name, business name, and city name. The server can expand the cleaned place training data. Such as: synonyms such as "Beijing", etc. can be expanded from "Beijing City". The server side can train the AC automaton model according to the expanded place training data serving as a corpus, and therefore a place matching model is obtained.
The server side can identify address information of each level in the to-be-desensitized text according to the location matching model, so that address information of multiple levels can be obtained.
Step 620: and determining continuous address information combinations in the text to be desensitized according to the identified sensitive information position of each address information.
The server side can determine continuous sensitive information positions according to the sensitive information positions of the address information, and therefore a plurality of address information with continuous sensitive information positions are used as address information combinations. Wherein, the address information combination comprises at least two address information. Illustratively, the text to be desensitized comprises a complete home address of 26 units 101 in east region of Naohuayuan (Olympic park, Mingshan city, Liaoning province), the server identifies a plurality of address information such as the Liaoning province, the Anshan city, the Naohuayuan and the Minyayuan through a location matching model, and the address information can be determined to form an address information combination according to the continuous sensitive information positions of the address information.
Step 630: and aiming at each address information combination, calculating the total score of the address information combination according to the preset weight and the preset score of each level of address information in the address information combination.
The server side can form a plurality of address information combinations according to the identified address information. For each address information combination, the server side can perform weighted summation according to the preset weight and the preset score of each level of address information to obtain a total score. Wherein, the weight value corresponding to each level of address information can be determined according to the size of the corresponding range.
Illustratively, the weight of provincial-level address information is the lowest, the weight of city-level address information is higher, the weight of district-level address information is higher, and so on, the weight of cell-level address information is the highest. The score corresponding to each level of address information can also be determined according to the size of the corresponding range. Illustratively, the score of the provincial-level address information is the lowest, the score of the city-level address information is higher, the score of the district-level address information is higher, and so on, the score of the district-level address information is the highest.
For some address information that is not likely to be a home address, such as a shop name, a hospital name, etc., a smaller weight and score may be configured.
Step 640: and judging whether the total score of the address information combination is greater than a preset total score threshold value.
Step 650: if so, determining that the address information combination is sensitive information, and acquiring the sensitive information position of the address information combination.
After the server calculates the total score for the address information combination, it can determine whether the total score is greater than the total score threshold. Wherein the total score threshold is used to distinguish between real home addresses.
In one aspect, if the total score is not greater than the total score threshold, the server may determine that the address information combination is not a home address. On the other hand, if the total score is greater than the total score threshold, the server may determine that the address information combination is a home address, in other words, the address information combination is sensitive information. In this case, the server may obtain a sensitive information location of the address information combination, where the sensitive information location includes a start location of the first address information in the address information combination and an end location of the last address information in the address information combination.
Before the method for desensitizing user data is executed, the server side can acquire address sample data added with the positive and negative marks. Each address sample data contains multi-level address information forming an address information combination, the positive mark can be a positive mark or a false mark, the positive mark indicates that the address sample data is a home address, and the false mark indicates that the address sample data is not the home address.
The server side can calculate a total score for each address sample data, and judge whether the address sample data is a home address according to a comparison result of the total score and a candidate total score threshold value. And the server side can calculate the whole precision ratio and the whole recall ratio according to the judgment result and the positive and negative marks of the address sample data. Further, the server may increase or decrease the threshold of the total score of candidates, repeat the above determination process, and calculate a new precision ratio and recall ratio. And when the precision ratio reaches the requirement of the preset precision ratio, selecting the candidate total score threshold value with the highest recall ratio as the actual total score threshold value.
In an embodiment, the desensitization information category is name, and referring to fig. 7, which is a flowchart of a sensitive information identification method provided in another embodiment of the present application, when the server identifies a text to be desensitized according to a sensitive information funnel model, the following steps 710 to 730 may be performed.
Step 710: and identifying the text to be desensitized through the trained name matching model to obtain suspected sensitive information in the text to be desensitized.
The name matching model can be obtained through AC automaton model training. Before executing the method for desensitizing user data in the scheme, the server side can train a name matching model. For example, the server may obtain a third-party name library as a corpus, and train the AC automaton model to obtain a name matching model.
The server side can identify names in the text to be desensitized according to the name matching model, so that suspected sensitive information is obtained. Here, the suspected sensitive information is information that may be a name.
Step 720: and filtering the suspected sensitive information through the specified domain dictionary and a preset suffix filtering rule to obtain the sensitive information.
Step 730: and acquiring the sensitive information position of the sensitive information.
In a specific domain (e.g., medical domain), some proper nouns are formed in a similar manner to names. In order to avoid using the proper nouns of a specific domain as names, the server can filter the meaning sensitive information through the domain-specific dictionary. The domain-specific dictionary is a special dictionary of a specific domain, and comprises a large number of proper nouns of the domain. Illustratively, the specified domain dictionary may be a term specific to the medical domain, including diseases, departments, hospitals, symptoms, body parts, commodities, samples, signs, locations, and filtering dictionaries consisting of their sets of Bi-GRAMs, Tri-GRAMs. The server can filter out suspected sensitive information matched with the words in the designated domain dictionary.
The server may further filter the filtered suspected sensitive information according to a suffix filtering rule. In the text message, it is common to include such referents as "professor of king", "li doctor", etc., which, although including surnames, do not retain specific names. Suspected sensitive words with suffixes "doctor", "professor", etc. may be filtered out by suffix filtering rules.
The server side can identify the suspected sensitive information left after the two times of filtering as a name, and under the condition, the server side can acquire the position of the sensitive information of the suspected sensitive information.
In an embodiment, after obtaining the sensitive information position in the text to be desensitized, the server may return the sensitive information position and the desensitization information category corresponding to the sensitive information to the source of the data desensitization request. By the aid of the method, the client sending the data desensitization request can know the position and the desensitization information category of the sensitive information actually existing in the text to be desensitized more clearly.
Illustratively, the data desensitization request initiated by the client includes three desensitization information categories of "identification card number", "bank card number" and "name". After the server side is identified, only the sensitive information corresponding to the identification number and the name is obtained, and at the moment, the server side can return the identification number, the name and the sensitive information position corresponding to each desensitization information category to the client side.
In an embodiment, when the server replaces the sensitive information with the designated character, it may determine whether the different sensitive information positions in the text to be desensitized are consecutive. On one hand, if the positions of the sensitive information are not continuous, the server can directly replace the sensitive information with the designated characters. On the other hand, if there are at least two consecutive sensitive information positions, the server may merge the consecutive sensitive information positions into one sensitive information position.
Illustratively, the text to be desensitized is "data wrong above, patient is fiscal jun 171 x 1345, birth 3.2.1947, age 70", the server may merge the sensitive information position of fiscal jun "with the 11 th character as the initial position, the 13 th character as the end position, and the sensitive information position of" 171 x 1345 "with the 14 th character as the initial position, the 24 th character as the end position" as the sensitive information position "with the 11 th character as the initial position, and the 24 th character as the end position".
The server can replace the sensitive information with the designated character at the position of the merged sensitive information. By the measures, the situation that when the continuous sensitive information is too long, the designated characters in the desensitized text are too long, and the readability of the desensitized text is poor can be avoided.
Referring to fig. 8, a block diagram of an apparatus for desensitizing user data provided in an embodiment of the present application is shown in fig. 8, where the apparatus may include:
an obtaining module 810, configured to obtain a data desensitization request; the data desensitization request comprises a text to be desensitized and a plurality of desensitization information categories;
the identification module 820 is used for identifying the text to be desensitized according to the sensitive information funnel model corresponding to each desensitization information category to obtain the sensitive information position in the text to be desensitized;
a replacing module 830, configured to replace the sensitive information with a specified character in the sensitive information position of the text to be desensitized, to obtain a desensitized text.
The implementation process of the functions and actions of each module in the apparatus is specifically detailed in the implementation process of the corresponding step in the user data desensitization method, and is not described herein again.
In the embodiments provided in the present application, the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims (10)

1. A method of user data desensitization, comprising:
acquiring a data desensitization request; the data desensitization request comprises a text to be desensitized and a plurality of desensitization information categories;
aiming at each desensitization information category, identifying the text to be desensitized according to a sensitive information funnel model corresponding to the desensitization information category to obtain a sensitive information position in the text to be desensitized;
and replacing sensitive information with designated characters in the sensitive information position of the text to be desensitized to obtain a desensitized text.
2. The method of claim 1, wherein the desensitization information category is an identification number or bank card number;
the identifying the text to be desensitized according to the sensitive information funnel model corresponding to the desensitization information category to obtain the sensitive information position in the text to be desensitized comprises the following steps:
judging whether the text to be desensitized has suspected sensitive information according to the regular expression corresponding to the desensitization information category;
if yes, verifying the suspected sensitive information according to a verification rule corresponding to the desensitization information category;
and when the suspected sensitive information is verified to be sensitive information corresponding to the desensitization information category, acquiring the sensitive information position of the sensitive information in the text to be desensitized.
3. The method of claim 1, wherein the desensitization information category is a telephone number, a cell phone number, or an email;
the identifying the text to be desensitized according to the sensitive information funnel model corresponding to the desensitization information category to obtain the sensitive information position in the text to be desensitized comprises the following steps:
judging whether sensitive information exists in the text to be desensitized according to the regular expression corresponding to the desensitization information category;
and if so, acquiring the position of the sensitive information in the text to be desensitized.
4. The method of claim 1, wherein the desensitization information category is home address;
the identifying the text to be desensitized according to the sensitive information funnel model corresponding to the desensitization information category to obtain the sensitive information position in the text to be desensitized comprises the following steps:
identifying each level of address information in the text to be desensitized through a trained place matching model;
according to the identified sensitive information position of each address information, determining continuous address information combinations in the text to be desensitized;
aiming at each address information combination, calculating the total score of the address information combination according to the preset weight and the preset score of each level of address information in the address information combination;
judging whether the total score of the address information combination is greater than a preset total score threshold value or not;
if yes, determining that the address information combination is sensitive information, and acquiring the sensitive information position of the address information combination.
5. The method of claim 1, wherein the desensitization information category is name;
the identifying the text to be desensitized according to the sensitive information funnel model corresponding to the desensitization information category to obtain the sensitive information position in the text to be desensitized comprises the following steps:
identifying the text to be desensitized through a trained name matching model to obtain suspected sensitive information in the text to be desensitized;
filtering the suspected sensitive information through a designated domain dictionary and a preset suffix filtering rule to obtain sensitive information;
and acquiring the sensitive information position of the sensitive information.
6. The method of claim 1, wherein after obtaining the location of sensitive information in the text to be desensitized, the method further comprises:
and returning the sensitive information position and the desensitization information category corresponding to the sensitive information to the source of the data desensitization request.
7. The method of claim 1, wherein the replacing sensitive information with specified characters in the sensitive information location of the text to be desensitized comprises:
judging whether the different sensitive information positions in the text to be desensitized are continuous or not;
if at least two sensitive information positions are continuous, combining the continuous sensitive information positions into one sensitive information position;
and replacing the sensitive information with the specified character at the position of the combined sensitive information.
8. An apparatus for user data desensitization, comprising:
the acquisition module is used for acquiring a data desensitization request; the data desensitization request comprises a text to be desensitized and a plurality of desensitization information categories;
the identification module is used for identifying the text to be desensitized according to the sensitive information funnel model corresponding to each desensitization information category to obtain the sensitive information position in the text to be desensitized;
and the replacing module is used for replacing the sensitive information with the designated character in the sensitive information position of the text to be desensitized to obtain the desensitized text.
9. An electronic device, characterized in that the electronic device comprises:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the method of user data desensitization of any of claims 1-7.
10. A computer-readable storage medium, characterized in that the storage medium stores a computer program executable by a processor to perform the method of user data desensitization according to any of claims 1-7.
CN202011206902.2A 2020-11-02 2020-11-02 Method and device for desensitizing user data, electronic equipment and storage medium Pending CN112329055A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011206902.2A CN112329055A (en) 2020-11-02 2020-11-02 Method and device for desensitizing user data, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011206902.2A CN112329055A (en) 2020-11-02 2020-11-02 Method and device for desensitizing user data, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112329055A true CN112329055A (en) 2021-02-05

Family

ID=74323145

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011206902.2A Pending CN112329055A (en) 2020-11-02 2020-11-02 Method and device for desensitizing user data, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112329055A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113268768A (en) * 2021-05-24 2021-08-17 平安普惠企业管理有限公司 Desensitization method, apparatus, device and medium for sensitive data
CN113569293A (en) * 2021-08-12 2021-10-29 明品云(北京)数据科技有限公司 Similar user acquisition method, system, electronic device and medium
CN114116644A (en) * 2021-11-26 2022-03-01 北京字节跳动网络技术有限公司 Log file processing method, device, equipment and storage medium
CN114239016A (en) * 2021-12-15 2022-03-25 阳光财产保险股份有限公司 Data security processing method, system and storage medium
CN114943969A (en) * 2022-06-16 2022-08-26 平安普惠企业管理有限公司 Method, device, equipment and storage medium for intelligently identifying and desensitizing sensitive information
CN115002508A (en) * 2022-06-07 2022-09-02 中国工商银行股份有限公司 Live data stream method and device, computer equipment and storage medium
CN116842560A (en) * 2023-06-19 2023-10-03 北京泰镝科技股份有限公司 Sensitive information desensitization display method, device and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809157A (en) * 2015-03-25 2015-07-29 小米科技有限责任公司 Number recognition method and device
CN107480549A (en) * 2017-06-28 2017-12-15 银江股份有限公司 A kind of shared sensitive information desensitization method of data-oriented and system
CN108268785A (en) * 2016-12-30 2018-07-10 广东精点数据科技股份有限公司 A kind of sensitive data identification and the device and method of desensitization
CN109344370A (en) * 2018-08-23 2019-02-15 阿里巴巴集团控股有限公司 Sensitive content desensitization, restoring method, device and equipment
CN109344258A (en) * 2018-11-28 2019-02-15 中国电子科技网络信息安全有限公司 A kind of intelligent self-adaptive sensitive data identifying system and method
CN109726585A (en) * 2018-12-14 2019-05-07 银江股份有限公司 A kind of integrated data desensitization system and method towards ID card No.
CN109918548A (en) * 2019-04-08 2019-06-21 上海凡响网络科技有限公司 A kind of methods and applications of automatic detection document sensitive information
CN110313147A (en) * 2017-03-09 2019-10-08 西门子公司 Data processing method, device and system
CN110489990A (en) * 2018-05-15 2019-11-22 ***通信集团浙江有限公司 A kind of sensitive data processing method, device, electronic equipment and storage medium
CN111274610A (en) * 2020-01-21 2020-06-12 京东数字科技控股有限公司 Data desensitization method and device and desensitization service platform
CN111832062A (en) * 2019-04-19 2020-10-27 珠海金山办公软件有限公司 Method and device for desensitizing selected area data in table file

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809157A (en) * 2015-03-25 2015-07-29 小米科技有限责任公司 Number recognition method and device
CN108268785A (en) * 2016-12-30 2018-07-10 广东精点数据科技股份有限公司 A kind of sensitive data identification and the device and method of desensitization
CN110313147A (en) * 2017-03-09 2019-10-08 西门子公司 Data processing method, device and system
CN107480549A (en) * 2017-06-28 2017-12-15 银江股份有限公司 A kind of shared sensitive information desensitization method of data-oriented and system
CN110489990A (en) * 2018-05-15 2019-11-22 ***通信集团浙江有限公司 A kind of sensitive data processing method, device, electronic equipment and storage medium
CN109344370A (en) * 2018-08-23 2019-02-15 阿里巴巴集团控股有限公司 Sensitive content desensitization, restoring method, device and equipment
CN109344258A (en) * 2018-11-28 2019-02-15 中国电子科技网络信息安全有限公司 A kind of intelligent self-adaptive sensitive data identifying system and method
CN109726585A (en) * 2018-12-14 2019-05-07 银江股份有限公司 A kind of integrated data desensitization system and method towards ID card No.
CN109918548A (en) * 2019-04-08 2019-06-21 上海凡响网络科技有限公司 A kind of methods and applications of automatic detection document sensitive information
CN111832062A (en) * 2019-04-19 2020-10-27 珠海金山办公软件有限公司 Method and device for desensitizing selected area data in table file
CN111274610A (en) * 2020-01-21 2020-06-12 京东数字科技控股有限公司 Data desensitization method and device and desensitization service platform

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘宇: "基于AC自动机和地址概率模型的地址标准化算法", 计算机与现代化, pages 188 - 189 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113268768A (en) * 2021-05-24 2021-08-17 平安普惠企业管理有限公司 Desensitization method, apparatus, device and medium for sensitive data
CN113268768B (en) * 2021-05-24 2024-04-16 重庆颂车网络科技有限公司 Desensitization method, device, equipment and medium for sensitive data
CN113569293A (en) * 2021-08-12 2021-10-29 明品云(北京)数据科技有限公司 Similar user acquisition method, system, electronic device and medium
CN113569293B (en) * 2021-08-12 2024-06-07 明品云(北京)数据科技有限公司 Similar user acquisition method, system, electronic equipment and medium
CN114116644A (en) * 2021-11-26 2022-03-01 北京字节跳动网络技术有限公司 Log file processing method, device, equipment and storage medium
CN114116644B (en) * 2021-11-26 2024-01-30 抖音视界有限公司 Log file processing method, device, equipment and storage medium
CN114239016A (en) * 2021-12-15 2022-03-25 阳光财产保险股份有限公司 Data security processing method, system and storage medium
CN115002508A (en) * 2022-06-07 2022-09-02 中国工商银行股份有限公司 Live data stream method and device, computer equipment and storage medium
CN114943969A (en) * 2022-06-16 2022-08-26 平安普惠企业管理有限公司 Method, device, equipment and storage medium for intelligently identifying and desensitizing sensitive information
CN116842560A (en) * 2023-06-19 2023-10-03 北京泰镝科技股份有限公司 Sensitive information desensitization display method, device and storage medium

Similar Documents

Publication Publication Date Title
CN112329055A (en) Method and device for desensitizing user data, electronic equipment and storage medium
CN107609163B (en) Medical knowledge map generation method, storage medium and server
CN107784058B (en) Medicine data processing method and device
Modha et al. Filtering aggression from the multilingual social media feed
CN112417096B (en) Question-answer pair matching method, device, electronic equipment and storage medium
Matci et al. Address standardization using the natural language process for improving geocoding results
CN106940788B (en) Intelligent scoring method and device, computer equipment and computer readable medium
KR100717998B1 (en) Method for examining plagiarism of document
CN109739997B (en) Address comparison method, device and system
EP3591539A1 (en) Parsing unstructured information for conversion into structured data
CN113934895A (en) Method for assisting in establishing patient main index
CN109360089A (en) Credit risk prediction technique and device
CN110741376A (en) Automatic document analysis for different natural languages
CN112069329B (en) Text corpus processing method, device, equipment and storage medium
CN110752027B (en) Electronic medical record data pushing method, device, computer equipment and storage medium
CN111883253A (en) Disease data analysis method and lung cancer risk prediction system based on medical knowledge base
Yogarajan et al. A survey of automatic de-identification of longitudinal clinical narratives
CN111723870A (en) Data set acquisition method, device, equipment and medium based on artificial intelligence
CN110020005A (en) Symptom matching process in main suit and present illness history in a kind of case history
CN113111159A (en) Question and answer record generation method and device, electronic equipment and storage medium
CN111709845A (en) Medical insurance fraud behavior identification method and device, electronic equipment and storage medium
CN105701085A (en) Network duplicate checking method and system
CN114840684A (en) Map construction method, device and equipment based on medical entity and storage medium
JP5049965B2 (en) Data processing apparatus and method
CN111104481A (en) Method, device and equipment for identifying matching field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination