CN113283232A

CN113283232A - Method and device for automatically analyzing private information in text

Info

Publication number: CN113283232A
Application number: CN202110601345.2A
Authority: CN
Inventors: 鲍梦瑶; 刘佳伟; 章鹏; 刘新源; 张谦; 贾茜
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-08-20

Abstract

The embodiment of the specification provides a method and a device for automatically analyzing privacy information in a text, wherein the method comprises the following steps: acquiring a text to be analyzed; performing word segmentation processing on the text to be analyzed to obtain a word sequence containing a plurality of words; coding the word sequence based on context to obtain word vectors corresponding to the words respectively; determining probabilities that the corresponding words respectively belong to a plurality of privacy information categories according to the word vectors; determining the privacy information category corresponding to the maximum probability in the probabilities as the attribution category of the corresponding word; and determining the analysis result of the text to be analyzed according to the attribution type of the words and the positions of the words in the word sequence. The analysis effect of the text can be improved.

Description

Method and device for automatically analyzing private information in text

Technical Field

One or more embodiments of the present specification relate to the field of computers, and more particularly, to a method and apparatus for automatically parsing private information in text.

Background

Private data (private data) or secret data, which refers to information that is not intended to be known by others or unrelated persons, etc., can be divided into individual private data and common private data from the perspective of the owner of privacy, wherein the individual private data includes information (such as phone numbers, addresses, credit card numbers, etc.) and sensitive information (such as personal health, financial information, company critical documents, etc.) that can be used to locate or identify an individual. The common privacy data mainly takes family privacy as a main part, such as family annual income condition and the like. The disclosure and abuse of private data is highly likely to cause various personal and public security problems. To prevent the disclosure and misuse of private data, it is often involved in automatically parsing private information in text.

In the prior art, a named entity model is constructed by using data labeled in advance, and then privacy information is extracted from a text by using the named entity model. Because a large amount of labeling data is needed, the labeling difficulty is high, the model calculation is complex, and the text analysis effect is poor.

Therefore, an improved scheme for improving the text parsing effect is desired.

Disclosure of Invention

One or more embodiments of the present specification describe a method and an apparatus for automatically parsing private information in a text, which can improve the parsing effect of the text.

In a first aspect, a method for automatically resolving private information in a text is provided, and the method includes:

acquiring a text to be analyzed;

performing word segmentation processing on the text to be analyzed to obtain a word sequence containing a plurality of words;

coding the word sequence based on context to obtain word vectors corresponding to the words respectively;

determining probabilities that the corresponding words respectively belong to a plurality of privacy information categories according to the word vectors;

determining the privacy information category corresponding to the maximum probability in the probabilities as the attribution category of the corresponding word;

and determining the analysis result of the text to be analyzed according to the attribution type of the words and the positions of the words in the word sequence.

In a possible implementation manner, the performing word segmentation processing on the text to be parsed includes:

splitting the text to be analyzed into a plurality of sentences;

and taking any one of the sentences as a target sentence, inputting the target sentence into a transfer learning model, and performing word segmentation processing on the target sentence through the transfer learning model to obtain a word sequence comprising a plurality of words.

In one possible embodiment, the context-based encoding of the word sequence includes:

and inputting the word sequence into a coding layer of a deep learning model, and carrying out context-based coding on the word sequence through the coding layer to obtain word vectors corresponding to the words respectively.

Further, determining, according to the word vector, probabilities that the corresponding words respectively belong to a plurality of privacy information categories includes:

and inputting the word vector into a classification layer of the deep learning model, and outputting the probabilities that the corresponding words belong to a plurality of privacy information categories respectively through the classification layer.

In a possible implementation manner, the determining, according to the attribution type of a word and the position of the word in the word sequence, a parsing result of the text to be parsed includes:

checking whether a plurality of words at adjacent positions in the text to be analyzed are in the same attribution type or not according to the attribution type of the words and the positions of the words in the word sequence;

and combining a plurality of words at adjacent positions of the same attribution type to serve as a result unit, and determining the attribution type corresponding to the result unit and the position of the attribution type in the word sequence as the analysis result of the text to be analyzed.

In one possible implementation mode, the text to be analyzed is privacy statement text of an application program;

the plurality of privacy information categories include: a non-privacy category free of privacy information and privacy statement compliance information, and a number of privacy categories corresponding to a number of preset categories of privacy statement compliance information.

Further, the number of preset categories of privacy statement compliance information includes at least one of:

the method comprises the following steps of storing the privacy information for a term, processing the expiration of the privacy information, storing the privacy information for a region, complaining and feeding back channels, basic conditions of an application program operator and contact ways of privacy information protection responsible persons.

after determining the parsing result of the text to be parsed, the method further includes:

obtaining a code analysis result of the application program, wherein the code analysis result indicates a first category set formed by privacy information categories actually collected by the application program;

determining a second category set formed by the privacy information categories collected by the privacy statement text statement according to the analysis result of the text to be analyzed;

and when the first category set is consistent with the second category set and all included privacy information categories belong to privacy information categories which are allowed to be collected by the application program in laws and regulations, determining the compliance of the application program.

In a second aspect, an apparatus for automatically parsing private information in text is provided, the apparatus comprising:

the acquisition unit is used for acquiring a text to be analyzed;

the word segmentation unit is used for carrying out word segmentation processing on the text to be analyzed acquired by the acquisition unit to obtain a word sequence containing a plurality of words;

the coding unit is used for carrying out context-based coding on the word sequence obtained by the word segmentation unit to obtain word vectors corresponding to the words respectively;

the probability determining unit is used for determining the probabilities that the corresponding words belong to a plurality of privacy information categories respectively according to the word vectors obtained by the encoding unit;

the category determining unit is used for determining the privacy information category corresponding to the maximum probability in the probabilities obtained by the probability determining unit as the attribution category of the corresponding word;

and the result determining unit is used for determining the analysis result of the text to be analyzed according to the attribution type of the words obtained by the type determining unit and the positions of the words in the word sequence.

In a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.

In a fourth aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of the first aspect.

According to the method and the device provided by the embodiment of the specification, firstly, a text to be analyzed is obtained; then, performing word segmentation processing on the text to be analyzed to obtain a word sequence containing a plurality of words; then, carrying out context-based coding on the word sequence to obtain word vectors corresponding to the words respectively; determining probabilities that the corresponding words respectively belong to a plurality of privacy information categories according to the word vectors; then, determining the privacy information category corresponding to the maximum probability in the probabilities as the attribution category of the corresponding word; and finally, determining the analysis result of the text to be analyzed according to the attribution type of the words and the positions of the words in the word sequence. As can be seen from the above, in the embodiments of the present specification, word segmentation is performed on a text to be analyzed, and then the attribution type of each word is determined, so that not only can the privacy information type of the word included in the text be obtained, but also the position of the word of each privacy information type appearing in the text be obtained, where the position is a position naturally formed in a word sequence obtained by word segmentation, and a position data training model labeled manually is not needed, so that the analysis effect of the text can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;

FIG. 2 illustrates a flow diagram of a method of automatically parsing private information in text, according to one embodiment;

FIG. 3 illustrates a parsing process diagram for a text example, according to one embodiment;

FIG. 4 illustrates a parsing process diagram for a text example, according to another embodiment;

FIG. 5 illustrates a multi-classification model structure diagram according to one embodiment;

FIG. 6 illustrates an overall architectural diagram of privacy compliance, according to one embodiment;

FIG. 7 shows a schematic block diagram of an apparatus for automatically parsing private information in text, according to one embodiment.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. The implementation scenario relates to automatic analysis of privacy information in a text, wherein the text to be analyzed can be a privacy statement text of an application (App), and the type of the privacy information collected by the privacy statement text statement and the position of the corresponding privacy information can be known by analyzing the privacy statement text. Referring to fig. 1, the text to be parsed is generally a long text and includes a plurality of sentences, for example, a plurality of sentences separated by periods in fig. 1, each sentence may include private information and non-private information, and in order to highlight the private information in the text, the non-private information is denoted by x in the figure, and by parsing the privacy statement text, the collected private information of the privacy statement text statement includes private information 1, private information 2, private information 3, private information 4 and private information 5, where the private information 1 belongs to a private information category 1, the position of which in the privacy statement text is position 1, the private information 2 belongs to a private information category 2, the position of which in the privacy statement text is position 2, the private information 3 belongs to a private information category 3, the position of which in the privacy statement text is position 3, and the private information 4 belongs to a private information category 1, its position in the privacy statement text is position 4 and the privacy information 5 belongs to the privacy information category 2, its position in the privacy statement text is position 5. It is understood that different pieces of privacy information may have the same privacy information category, for example, the privacy information 1 and the privacy information 4 both belong to the privacy information category 1, the privacy information 2 and the privacy information 5 both belong to the privacy information category 2, the privacy statement text states that the collected privacy information categories include the privacy information category 1, the privacy information category 2 and the privacy information category 3, and whether the corresponding application program is in compliance may be subsequently determined according to the analysis result of the privacy statement text, where the compliance includes the privacy information category that the application program is allowed to collect in compliance with the laws and regulations.

The private information is generally specific, and the range of the private information is wider with respect to the private information category. The first table is a corresponding relationship table between the privacy information and the privacy information category provided in the embodiments of the present specification.

Table one: corresponding relation table of privacy information and privacy information category

It should be noted that the embodiment of the present specification provides a text parsing method, which has a very wide application scenario and can be applied to various text parsing scenarios. For example, in addition to automatically analyzing the privacy information in the text, the method may also be applied to automatically analyzing color information in the text, to obtain a color type included in the text and a position of the color information in the text, where the color information may include red, blue, and the like, and the color type may include cool tone, warm tone, and the like; or automatically analyzing the region information in the text to obtain the region type and the position of the region information in the text, wherein the region information can comprise Beijing, New York and the like, and the region type can comprise China, America and the like; or automatically analyzing the price information in the text to obtain price categories and positions of the price information in the text, wherein the price categories can include 5 yuan, 100 yuan and the like, and the price categories can include categories respectively corresponding to price intervals; or automatically analyzing the commodity information in the text to obtain the commodity type and the position of the commodity information in the text, wherein the commodity information can comprise pencils, washing machines and the like, and the commodity type can comprise stationery, electric appliances and the like. In the embodiment of the specification, the type of the information and the position of the information appearing in the text can be obtained in various text parsing scenes. In addition, the text to be parsed is not limited to the privacy statement text of the application, and for example, the text to be parsed may be a description of the product or the like.

Fig. 2 shows a flowchart of a method for automatically resolving private information in text according to an embodiment, which may be based on the implementation scenario shown in fig. 1. As shown in fig. 2, the method for automatically resolving the private information in the text in this embodiment includes the following steps: step 21, acquiring a text to be analyzed; step 22, performing word segmentation processing on the text to be analyzed to obtain a word sequence containing a plurality of words; step 23, performing context-based coding on the word sequence to obtain word vectors corresponding to the words respectively; step 24, determining the probabilities that the corresponding words belong to a plurality of privacy information categories respectively according to the word vectors; step 25, determining the privacy information category corresponding to the maximum probability in the probabilities as the attribution category of the corresponding word; and 26, determining an analysis result of the text to be analyzed according to the attribution type of the words and the positions of the words in the word sequence. Specific execution modes of the above steps are described below.

First, in step 21, a text to be parsed is obtained. It is understood that the text to be parsed is usually a long text, and includes a plurality of sentences, including the private information.

In one example, the text to be parsed is privacy statement text of an application.

When the application program is released, a literal privacy statement, namely a privacy statement text, needs to be matched, wherein various privacy information which is declared to be collected and not collected by the enterprise, including but not limited to personal location information, personal biological information and the like, should be listed.

Then, in step 22, the text to be analyzed is subjected to word segmentation processing to obtain a word sequence including a plurality of words. It will be appreciated that a sequence of words contains a number of words in a naturally occurring, sequential order, each word having a particular position in the sequence of words.

In one example, the performing word segmentation processing on the text to be parsed includes:

splitting the text to be analyzed into a plurality of sentences;

Transfer learning (transfer learning) belongs to a research field of machine learning. It focuses on storing the solution models for the existing problems and takes advantage of them on other different but related problems.

Given a statement of character length n t₁,…,t_nObtaining a word sequence (w) containing m words after word segmentation processing₁,…,w_mWhere m is generally smaller than n, that is, after the word segmentation process, there is a case where a plurality of characters are divided into one word. For example, the sentence is "when you register, log in and use the related service", and after the word segmentation processing, the sentence is "when/you/register/,/log in/and/use/related/service/time", wherein two adjacent words are separated by/between, it can be seen that the registration and log in words comprise two characters, and when, your words comprise one character.

Next, in step 23, context-based encoding is performed on the word sequence to obtain word vectors corresponding to the words. It will be appreciated that the words correspond one-to-one to word vectors, and if a word sequence includes m words, the above encoding results in m vectors.

In one example, the context-based encoding of the sequence of words includes:

The coding layer may be implemented based on a Convolutional Neural Network (CNN) or a long-term memory network (LSTM), and has good adaptability.

If bagThe word sequence containing m words is denoted as w₁,…,w_mGet the word vector corresponding to each word and represent it as { h }_w1,…,h_wm}。

In step 24, the probabilities that the corresponding words belong to the privacy information categories are determined according to the word vectors. It is understood that a plurality of privacy information categories, for example, privacy information category 1, privacy information category 2, privacy information category 3, are divided in advance, and by this step, the probability 1 that a word belongs to the privacy information category 1, the probability 2 that a word belongs to the privacy information category 2, and the probability 3 that a word belongs to the privacy information category 3 are determined, respectively.

In one example, the determining, according to the word vector, probabilities that words corresponding to the word vector respectively belong to a plurality of privacy information categories includes:

Word segmentation w_iBelonging to the privacy information class c_qCan be expressed as p (c)_q\w_i)＝softmax(W*h_wi) Wherein h is_wiFor word segmentation w_iThe corresponding word vector, W is the fully-connected matrix, softmax is the normalized exponential function that controls the range of each element of the k-dimensional vector between (0,1), and the sum of all elements of the vector is 1.

In one example, the text to be parsed is privacy statement text of an application; the plurality of privacy information categories include: a non-privacy category free of privacy information and privacy statement compliance information, and a number of privacy categories corresponding to a number of preset categories of privacy statement compliance information.

the method comprises the following steps of storing the privacy information for a term, processing the expiration of the privacy information, storing the privacy information for a region, complaining and feeding back channels, basic conditions of an application program operator and contact ways of privacy information protection responsible persons. It is to be understood that, in addition to the provision of the type of privacy information collected by the application, provision may be made in the law, for example, that at least one of the privacy statement compliance information should be included in the privacy statement text. Table two is a correspondence table between the privacy statement compliance information and the privacy information category provided in the embodiments of the present specification.

Table two: correspondence table of privacy statement compliance information and privacy information category

It can be understood that the general privacy information categories include privacy categories corresponding to specific privacy information such as personal basic information and personal identity information listed in the table i, and in the embodiment of the present specification, privacy categories corresponding to privacy statement compliance information such as a privacy information storage term and a privacy information expiration processing mode may also be included on this basis, so that the comprehensiveness of analyzing the privacy statement text is improved, and the comprehensiveness of performing compliance check according to an analysis result in the following process is facilitated.

Then, in step 25, the privacy information category corresponding to the maximum probability among the probabilities is determined as the attribution category of the corresponding word. It will be appreciated that the greater the probability that a word corresponds to a certain category of private information, the more likely the word belongs to that category of private information.

For example, the plurality of privacy information categories divided in advance are privacy information category 1, privacy information category 2, and privacy information category 3, respectively, the probability that word 1 corresponds to privacy information category 1 is p1, the probability that word 1 corresponds to privacy information category 2 is p2, the probability that word 1 corresponds to privacy information category 3 is p3, and if p1< p2< p3, it is determined that privacy information category 3 is the belonging category of word 1.

Finally, in step 26, the parsing result of the text to be parsed is determined according to the attribution type of the word and the position of the word in the word sequence. It can be understood that the attribution type of the word and the position of the word in the word sequence may be directly used as the analysis result of the text to be analyzed, or the attribution type of the word and the position of the word in the word sequence may be used as an intermediate result, and the analysis result of the text to be analyzed is obtained after the intermediate result is continuously analyzed and processed.

In one example, the determining a parsing result of the text to be parsed according to the attribution type of a word and the position of the word in the word sequence includes:

FIG. 3 illustrates a parsing process diagram for a text example, according to one embodiment. Referring to fig. 3, the privacy information category referred in the figure is O (other, i.e. without target information), BI (personal basic data, which is one of the target information), and III (network identification information, which is one of the target information), it can be understood that the target information is information that needs to be extracted from the text to be parsed, and includes privacy information and privacy statement compliance information. Target sentences in the text to be analyzed are firstly subjected to a migration learning model to obtain a word sequence formed by each participle, then each word in the word sequence is classified by using a deep learning model, and finally all privacy information and privacy compliance information related to the text to be analyzed and the position of the privacy compliance information in the word sequence are obtained as an analysis result, for example, the privacy information category to which a mobile phone number in fig. 3 belongs is BI, and the position is the 15 th element in the word sequence after the participle.

FIG. 4 illustrates a parsing process diagram for a text example, according to another embodiment. Referring to fig. 4, the privacy information types referred to in the figure include O (other information, that is, not including target information), STP (personal information storage term, which is one of target information), and ODP (personal information expiration processing method, which is one of target information). Target sentences in the text to be analyzed are firstly subjected to a transfer learning model to obtain a word sequence formed by each participle, then each word in the word sequence is classified by using a deep learning model to obtain all privacy information and privacy statement compliance information related in the text to be analyzed and the positions of the privacy information and the privacy statement compliance information in the word sequence as intermediate results, fig. 4 shows that when the privacy information or the privacy declaration compliance information is composed of a plurality of words, the deep learning model predicts the plurality of words as the same privacy information category, and then performs a merging operation on words of the same privacy information category (i.e., privacy information categories other than the privacy information category O) which are adjacently located, to obtain the final analysis result, for example, "not less", "six months" in fig. 4 are combined to "not less than six months", and the complete personal information shelf life information is obtained.

In the embodiment of the specification, a multi-classification model is used for text analysis, and the multi-classification model automatically judges which kinds of sensitive information are declared in a privacy declaration text of an application program and the position of the corresponding sensitive information by means of deep learning, transfer learning and the like, wherein the sensitive information is the privacy information or privacy declaration compliance information.

Multi-classification (multi-classification) is a kind of supervised learning (supervised learning), and its main objective is to determine to which class of known samples a new sample belongs according to some features of the known samples. The multi-classification model is specifically to classify samples by calculating and selecting characteristic parameters and creating a discriminant function according to sample data provided by a known training set.

Supervised learning is a method of machine learning, which refers to classifying or fitting input data given a previously labeled training example.

Deep learning (deep learning) is a branch of machine learning, and is an algorithm for performing characterization learning on data by using an artificial neural network as a framework.

FIG. 5 illustrates a multi-classification model structure diagram according to one embodiment. Referring to fig. 5, the multi-classification model is a word sequence classification model based on word granularity, first, a sentence in a text to be parsed is input into a transition learning model, the sentence is segmented by the transition learning model to obtain a word sequence composed of a plurality of words, for example, the word sequence in the figure includes word 1, … and word m, then the word sequence is input into a coding layer of a deep learning model, each word is coded by the coding layer to obtain word vectors corresponding to the words, for example, word 1 corresponds to word vector 1, …, word m corresponds to word vector m, then each word vector is input into a classification layer of the deep learning model, the classification layer is used to obtain the probability that each word in the word sequence corresponds to each category, for example, word 1 corresponds to category 1, …, the probability that word 1 corresponds to category m, and according to the probability that each word corresponds to each category, the prediction type of the word can be obtained, and words belonging to the same prediction type at adjacent positions are merged to obtain sensitive information and the position of the sensitive information in the sentence, wherein the sensitive information can be privacy information or privacy statement compliance information.

The method provided by the embodiment of the present specification can be further combined with other text parsing manners, for example, privacy statement compliance information with a prominent format characteristic, such as complaints and feedback channels (email, phone, address) can be obtained by using a regular expression matching manner.

The embodiment of the specification uses an end-to-end scheme, blank sentences do not need to be pre-judged in advance, and the method is convenient to use and high in identification efficiency. Based on word granularity, the privacy information category contained in the text can be known, the position where the privacy information appears can be obtained, and the refinement degree is high.

In one example, the text to be parsed is privacy statement text of an application;

In an embodiment of the present specification, the privacy compliance check for an application mainly includes: and analyzing the laws and regulations to form a mapping between the application program category and the privacy information allowed to be collected by the laws and regulations. And analyzing the privacy statement text of the application program, and extracting the collected privacy information declared in the privacy statement text. Analyzing the codes of the application programs, and extracting the privacy information really collected in the codes. And integrating the extracted information to judge whether the application program violates the condition of collecting the privacy information.

FIG. 6 illustrates an overall architectural diagram of privacy compliance, according to one embodiment. Referring to fig. 6, the decision module inputs data analysis from three parties, including privacy information collected by statements extracted from App privacy statement texts, actually collected privacy information indicated by App code analysis results, and privacy information allowed to be collected indicated by law and regulation analysis results, and finally obtains a compliance report by comparing the three. The embodiment of the specification mainly provides a solution for analyzing the privacy statement text, the text analysis is realized by using a multi-classification model, the two stages are mainly adopted, and the multi-classification model is trained by using data marked in advance in the training stage. In the testing stage, a privacy statement text of a certain App is given, sentence segmentation is firstly carried out on the text, results after the word segmentation are sequentially input into a trained multi-classification model for prediction, the predicted results are collected into a privacy data set collected by a statement, and the privacy data set collected by the statement maintains all privacy information collected by the statement in the App privacy statement text and corresponding positions.

According to the method provided by the embodiment of the specification, firstly, a text to be analyzed is obtained; then, performing word segmentation processing on the text to be analyzed to obtain a word sequence containing a plurality of words; then, carrying out context-based coding on the word sequence to obtain word vectors corresponding to the words respectively; determining probabilities that the corresponding words respectively belong to a plurality of privacy information categories according to the word vectors; then, determining the privacy information category corresponding to the maximum probability in the probabilities as the attribution category of the corresponding word; and finally, determining the analysis result of the text to be analyzed according to the attribution type of the words and the positions of the words in the word sequence. As can be seen from the above, in the embodiments of the present specification, word segmentation is performed on a text to be analyzed, and then the attribution type of each word is determined, so that not only can the privacy information type of the word included in the text be obtained, but also the position of the word of each privacy information type appearing in the text be obtained, where the position is a position naturally formed in a word sequence obtained by word segmentation, and a position data training model labeled manually is not needed, so that the analysis effect of the text can be improved.

According to an embodiment of another aspect, an apparatus for automatically parsing private information in text is further provided, where the apparatus is configured to perform the method for automatically parsing private information in text provided by the embodiments of this specification. FIG. 7 shows a schematic block diagram of an apparatus for automatically parsing private information in text, according to one embodiment. As shown in fig. 7, the apparatus 700 includes:

an obtaining unit 71, configured to obtain a text to be parsed;

a word segmentation unit 72, configured to perform word segmentation processing on the text to be analyzed acquired by the acquisition unit 71, so as to obtain a word sequence including a plurality of words;

a coding unit 73, configured to perform context-based coding on the word sequence obtained by the word segmentation unit 72, so as to obtain word vectors corresponding to the words respectively;

a probability determining unit 74, configured to determine, according to the word vector obtained by the encoding unit 73, probabilities that words corresponding to the word vector respectively belong to multiple privacy information categories;

a category determining unit 75, configured to determine, as an attribution category of the corresponding word, a privacy information category corresponding to a maximum probability among the probabilities obtained by the probability determining unit 74;

a result determining unit 76, configured to determine an analysis result of the text to be analyzed according to the attribution type of the word obtained by the type determining unit 75 and the position of the word in the word sequence.

Optionally, as an embodiment, the word segmentation unit 72 includes:

the sentence splitting subunit is used for splitting the text to be analyzed into a plurality of sentences;

and the word segmentation subunit is used for taking any one of the plurality of sentences obtained by the sentence splitting subunit as a target sentence, inputting the target sentence into a transfer learning model, and performing word segmentation processing on the target sentence through the transfer learning model to obtain a word sequence comprising a plurality of words.

Optionally, as an embodiment, the encoding unit 73 is specifically configured to input the word sequence into an encoding layer of a deep learning model, and perform context-based encoding on the word sequence through the encoding layer to obtain word vectors corresponding to the words respectively.

Further, the probability determining unit 74 is specifically configured to input the word vector into a classification layer of the deep learning model, and output, through the classification layer, probabilities that words corresponding to the word vector belong to multiple privacy information categories, respectively.

Optionally, as an embodiment, the result determining unit 76 includes:

the checking subunit is used for checking whether a plurality of words at adjacent positions in the text to be analyzed are in the same attribution type or not according to the attribution type of the words and the positions of the words in the word sequence;

and the merging subunit is used for merging a plurality of words at adjacent positions of the same attribution type obtained by the checking subunit, and determining the attribution type corresponding to the result unit and the position of the attribution type in the word sequence as the analysis result of the text to be analyzed.

Optionally, as an embodiment, the text to be parsed is a privacy declaration text of the application program;

the device further comprises:

a result obtaining unit, configured to obtain a code analysis result of the application program after the result determining unit 76 determines the analysis result of the text to be analyzed, where the code analysis result indicates a first category set formed by privacy information categories actually collected by the application program;

the set determining unit is used for determining a second category set formed by the privacy information categories collected by the privacy statement text statement according to the analysis result of the text to be analyzed;

and a compliance determining unit, configured to determine compliance of the application program when the first category set obtained by the result obtaining unit is consistent with the second category set obtained by the set determining unit, and all included privacy information categories belong to privacy information categories that allow the application program to collect under laws and regulations.

With the apparatus provided in the embodiment of the present specification, first, the obtaining unit 71 obtains a text to be parsed; then, the word segmentation unit 72 performs word segmentation processing on the text to be analyzed to obtain a word sequence including a plurality of words; then the coding unit 73 performs context-based coding on the word sequence to obtain word vectors corresponding to the words; the probability determining unit 74 determines, according to the word vector, probabilities that the corresponding words respectively belong to a plurality of privacy information categories; then, the category determining unit 75 determines the privacy information category corresponding to the maximum probability among the probabilities as the attribution category of the corresponding word; the final result determining unit 76 determines the parsing result of the text to be parsed according to the attribution type of the word and the position of the word in the word sequence. As can be seen from the above, in the embodiments of the present specification, word segmentation is performed on a text to be analyzed, and then the attribution type of each word is determined, so that not only can the privacy information type of the word included in the text be obtained, but also the position of the word of each privacy information type appearing in the text be obtained, where the position is a position naturally formed in a word sequence obtained by word segmentation, and a position data training model labeled manually is not needed, so that the analysis effect of the text can be improved.

According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.

According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in connection with fig. 2.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A method of automatically resolving private information in text, the method comprising:

acquiring a text to be analyzed;

2. The method of claim 1, wherein the performing word segmentation processing on the text to be parsed comprises:

splitting the text to be analyzed into a plurality of sentences;

3. The method of claim 1, wherein the context-based encoding of the sequence of words comprises:

4. The method of claim 3, wherein the determining, from the word vector, respective probabilities that their corresponding words respectively belong to a plurality of privacy information categories comprises:

5. The method of claim 1, wherein the determining the parsing result of the text to be parsed according to the attribution type of a word and the position of the word in the word sequence comprises:

6. The method of claim 1, wherein the text to be parsed is privacy statement text of an application;

7. The method of claim 6, wherein the number of preset categories of privacy statement compliance information includes at least one of:

8. The method of claim 1, wherein the text to be parsed is privacy statement text of an application;

9. An apparatus for automatically parsing private information in text, the apparatus comprising:

the acquisition unit is used for acquiring a text to be analyzed;

10. The apparatus of claim 9, wherein the word segmentation unit comprises:

11. The apparatus according to claim 9, wherein the encoding unit is specifically configured to input the word sequence into an encoding layer of a deep learning model, and perform context-based encoding on the word sequence through the encoding layer to obtain word vectors corresponding to the words, respectively.

12. The apparatus according to claim 11, wherein the probability determining unit is specifically configured to input the word vector into a classification layer of the deep learning model, and output, through the classification layer, probabilities that corresponding words respectively belong to a plurality of privacy information categories.

13. The apparatus of claim 9, wherein the result determination unit comprises:

14. The apparatus of claim 9, wherein the text to be parsed is privacy statement text of an application;

15. The apparatus of claim 14, wherein the number of preset categories of privacy statement compliance information comprises at least one of:

16. The apparatus of claim 9, wherein the text to be parsed is privacy statement text of an application;

the device further comprises:

the result obtaining unit is used for obtaining a code analysis result of the application program after the result determining unit determines the analysis result of the text to be analyzed, and the code analysis result indicates a first category set formed by privacy information categories actually collected by the application program;

17. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-8.

18. A computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of any of claims 1-8.