CN117272982A

CN117272982A - Protocol text detection method and device based on large language model

Info

Publication number: CN117272982A
Application number: CN202311206343.9A
Authority: CN
Inventors: 鲍梦瑶; 刘佳伟; 章鹏; 张谦; 杨仁慧
Original assignee: Ant Blockchain Technology Shanghai Co Ltd
Current assignee: Ant Blockchain Technology Shanghai Co Ltd
Priority date: 2023-09-18
Filing date: 2023-09-18
Publication date: 2023-12-22

Abstract

The embodiment of the specification provides a protocol text detection method and device based on a large language model. The method comprises the following steps: determining target elements corresponding to target paragraphs in target protocol texts (such as protocol texts related to privacy data) from a plurality of preset elements; any element of the plurality of elements is a question related to a protocol text and is preset with a prompter template corresponding to the element; generating a target prompter based on the target paragraph and a prompter template corresponding to the target element; inputting the target prompter into the large-scale language model, enabling the large-scale language model to conduct reasoning related to the target element, and outputting a reasoning result.

Description

Protocol text detection method and device based on large language model

Technical Field

The embodiment of the specification belongs to the technical field of computers, and particularly relates to a protocol text detection method and device based on a large language model.

Background

A Data Asset (Data Asset) may refer to a Data Asset, such as a file material, electronic Data, etc., that is owned or controlled by an enterprise, and that is capable of bringing future economic benefits to the enterprise. Currently, businesses may use and process data assets, such as data assets that pertain to personal privacy data information, and may also sign agreements related to the data assets with individuals or other businesses and the like.

Disclosure of Invention

The invention aims to provide a protocol text detection scheme based on a large language model, which can realize automatic analysis of protocol text in a few-sample/zero-sample scene by utilizing the generalization capability of the large language model and remarkably reduce the analysis cost of the protocol text.

The first aspect of the present specification provides a method for detecting protocol text based on a large language model, including: determining target elements corresponding to target paragraphs in the target protocol text from a plurality of preset elements; any element of the plurality of elements is a question related to a protocol text and is preset with a prompter template corresponding to the element; generating a target prompter based on the target paragraph and a prompter template corresponding to the target element; inputting the target prompter into the large-scale language model, so that the large-scale language model performs reasoning related to the target element, and outputting a reasoning result.

A second aspect of the present specification provides a large language model-based protocol text detection apparatus, including: a determining unit configured to determine a target element corresponding to a target paragraph in the target protocol text from a plurality of preset elements; any element of the plurality of elements is a question related to a protocol text and is preset with a prompter template corresponding to the element; a generating unit configured to generate a target prompter based on the target paragraph and a prompter template corresponding to the target element; an inference unit configured to input the target prompter into the large language model, cause the large language model to perform inference related to the target element, and output an inference result.

A third aspect of the present description provides a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method as described in the first aspect.

A fourth aspect of the present description provides a computing device comprising a memory having executable code stored therein and a processor which when executing the executable code implements the method as described in the first aspect.

A fifth aspect of the present description provides a computer program product which, when executed in a computer, causes the computer to perform the method as described in the first aspect.

In the solution provided in the embodiment of the present disclosure, a target element corresponding to a target paragraph in a target protocol text may be determined from a plurality of preset elements; any element of the plurality of elements is a question related to a protocol text, and a prompter template corresponding to the element is preset. Then, a target prompter can be generated based on a prompter template corresponding to the target paragraph and the target element, and the target prompter is input into the large-scale language model, so that the large-scale language model performs reasoning related to the target element, and a reasoning result is output. Therefore, the method can generate the prompter for inputting the large language model based on the prompter template corresponding to the element, and realize automatic analysis of the protocol text in a few-sample/zero-sample scene by utilizing the generalization capability of the large language model, thereby remarkably reducing the analysis cost of the protocol text.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present disclosure, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of one application scenario in which embodiments of the present description may be applied;

FIG. 2 is a schematic diagram of a segmentation process of target protocol text in an embodiment of the present disclosure;

FIG. 3 is a flow chart of a large language model based protocol text detection method in an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a target prompter in an embodiment of the present description;

FIG. 5 is a schematic diagram of a manual audit and feedback mechanism in an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a process for detecting operational behavior data in an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a large language model-based protocol text detection apparatus in the embodiment of the present specification.

Detailed Description

In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

As previously described, an enterprise may use and process data assets, such as data assets related to personal privacy data information, and may also sign agreements related to the data assets with individuals or other enterprises, etc.

In practice, the protocol text can be detected according to actual requirements. For example, the detection of necessary terms may be performed on the agreement text in case the agreement text lacks a definition of certain necessary terms.

In addition, the data is used as a national basic strategic resource and is the core and pulse of big data industry. In order to standardize the processing of data generation, collection, storage, processing, analysis, service and the like, a plurality of legal regulations and policy files are issued in China. Together, the laws, regulations and policy files jointly construct a data management legislation framework of China, jointly maintain network safety and data safety, promote the development of big data industry, activate the potential of data elements and accelerate the development quality change, efficiency change and power change of the economy and society. Therefore, how to legally and properly collect, process and apply personal information should place more emphasis on enterprises. In order to help enterprises realize legal compliance of the full life cycle of the data asset, avoid related legal risks, help regulatory institutions to manage illegal violations, ensure that protocol contents meet related legal requirements, and can carry out compliance detection on protocol texts.

Traditional protocol text detection methods typically rely on manually written rules or machine learning models based on labeling data. The methods have some limitations, such as difficult rule writing, high maintenance cost, poor generalization capability, difficult label data acquisition, high cost, limited coverage range and the like.

A large language model (Large Language Model, LLM) may refer to a deep neural network model trained using large amounts of text data, capable of generating natural language text from context. Wherein, the large language model can perform little sample/zero sample learning. Less sample learning may refer to training a model with a small amount of annotation data to enable it to adapt to new tasks or categories. Zero sample learning may refer to training a model with additional auxiliary information to enable it to identify unseen categories without any labeling data.

Considering the powerful generalization capability of a large language model in a few-sample/zero-sample scene, the embodiment of the specification provides a protocol text detection scheme based on the large language model, and the generalization capability of the large language model can be utilized to realize the automatic analysis of the protocol text in the few-sample/zero-sample scene, so that the analysis cost of the protocol text is obviously reduced.

Fig. 1 is a schematic diagram of one application scenario to which the embodiments of the present description may be applied. In the application scenario shown in fig. 1, a protocol text detection engine 101 may be included. The protocol text detection engine 101 may be used to automatically parse protocol text.

The large language model M, a plurality of elements related to the protocol text, and a word extracting template corresponding to each of the plurality of elements may be stored in the protocol text detection engine 101. In one example, the large language model M may be a large language model with billions of parameters. The large language model M may accept customized input of the prompter and output the result of text parsing, i.e., the inference result. In practice, the prompter (promt) may be a technique that employs adding additional text in the input section for better knowledge of the pre-trained language model.

In the present description embodiment, a single protocol text may include, but is not limited to, a protocol text related to private data, including for example, a privacy authorization protocol or privacy policy, etc. Each of the plurality of elements may be a question related to the protocol text. The prompter templates corresponding to the elements may be designed by a prompter designer based on related techniques of the prompter project. The prompter template corresponding to an element may be used to indicate that inferences are to be made regarding the element. In one example, the prompter template may include, for example, but is not limited to, instructions that include the element and a plurality of alternative answers. Further, the prompter template may further include at least one of: examples, slot markers for characterizing paragraphs, and the like. This example can be referred to by the large language model M, and can guide the large language model M to execute the instruction more correctly.

Note that, in a scenario where compliance detection of the protocol text is required, each element related to the protocol text stored in the protocol text detection engine 101 may be a compliance element. It is understood that the compliance element is an element for compliance detection.

The above-described elements stored in the protocol text detection engine 101 may be designed by industry experts according to laws and regulations and related regulations. In one example, embodiments of the present description may relate to a number of privacy data usage scenarios, each of which may include a plurality of elements respectively designed for the number of privacy data usage scenarios. The single privacy data usage scenario may include, for example, any of the following: a private data collection scene, a private data transmission scene, a private data storage scene, a private data sharing scene and the like. Taking the privacy data collection scenario as an example, in the case where compliance detection needs to be performed on the protocol text in the privacy data collection scenario, the plurality of compliance elements related to the protocol text designed for the privacy data collection scenario may include, for example, whether to claim to collect a personal name, whether to claim to collect a personal identity, whether to have an unreasonable disclaimer, and so on.

In one embodiment, each alternative answer in the instructions as previously described may correspond to a label, respectively. In the case where a single promulgation template includes examples, instructions, and slot markers for characterizing paragraphs, taking as an example a compliance element of "whether there is an unreasonable disclaimer," the promulgation template P corresponding to the compliance element may be, for example, "common unreasonable disclaimer phrases are: the platform does not bear any responsibility, the user independently bears any responsibility, and the like. Please determine if there is an unreasonable disclaimer in the following paragraphs based on this, only answer "include" or "do not include". [ corresponding paragraph ]. Wherein the expression "usual unreasonable disclaimer" is: the platform does not take any responsibility, the user independently takes any responsibility and the like as an example, and the platform can judge whether an unreasonable disclaimer exists in the following paragraphs according to the responsibility, and only needs to answer "containing" or "not containing" as instructions, wherein the "containing" and "not containing" in the instructions are alternative answers, and the "[ corresponding paragraph ] is a slot mark used for representing the paragraph. The label corresponding to the alternative answer "including" may be "yes", for example, and the label corresponding to the alternative answer "not including" may be "no", for example. It should be understood that the examples in the prompter template, the instruction, and the labels corresponding to the respective alternative answers in the instruction may be set according to actual requirements, and are not specifically limited herein.

For the target protocol text to be detected, the protocol text detection engine 101 or other engines may segment the target protocol text to obtain a target paragraph in the target protocol text before detecting the target protocol text. The protocol text detection engine 101 may then detect for the target passage in the target protocol text.

When the target protocol text is segmented, the target protocol text can be structurally analyzed through regular expressions and a context inference technology, and main section paragraphs and corresponding section titles of the target protocol text are analyzed.

In particular, reference is made to fig. 2, which is a schematic diagram of a segmentation process of target protocol text in an embodiment of the present disclosure.

As shown in fig. 2, in step S201, a preset regular expression for detecting a plurality of digital serial number identifiers is used to detect a digital serial number identifier of a target protocol text, and the detected digital serial number identifier and a title text corresponding to the digital serial number identifier are extracted from the target protocol text.

Wherein, the regular expression described above can be used to detect a variety of numerical sequence number identifications, such as "1", "one", etc. When the digital sequence number identification detection is carried out on the target protocol text, the target protocol text can be divided according to sentences or line-wrapping symbols so as to process the text content line by line. After each divided sub-text is obtained, the regular expression can be used for respectively detecting the number serial number identifiers of each sub-text, and the detected number serial number identifiers and the title text corresponding to the number serial number identifiers are extracted.

In step S203, the extracted digital serial number identifiers are classified.

Specifically, the extracted digital serial number identifiers may be classified according to the type of digital serial number identifier. For example, "1", "2" belongs to one category, "(two)", "(three)" belongs to one category.

In step S205, the respective levels of the extracted header texts are determined based on the appearance order of each type of numerical sequence number identification in the target protocol text.

Wherein the single level may be, for example, primary, secondary, tertiary, etc. The different levels of title text may have the following characteristics: the secondary headline text occurs after the primary headline text, the tertiary headline text occurs after the secondary headline text, and the primary headline text may occur after the secondary/tertiary headline text.

In step S207, the target protocol text is subjected to segmentation processing based on the level to which each of the header texts corresponds.

As an example, for any two adjacent primary headlines in the target protocol text, the primary headline preceding the order of appearance and the other text located between the two primary headlines may be formed into a target paragraph.

The scheme provided by the embodiment corresponding to FIG. 2 is efficient and flexible, and the regular expression is used for automatically matching and analyzing the digital sequence number identification, so that the method can adapt to the change of various text formats and styles, and the time and cost of manual analysis are greatly reduced. The correctness of the title of each level can be ensured through the analysis of the context relation of the digital serial number identification. In addition, the scalability of the scheme can be increased by modifying the regular expression, and more text formats can be adapted.

Fig. 3 is a flowchart of a protocol text detection method based on a large language model in the embodiment of the present specification. The method may be performed by the protocol text detection engine 101.

As shown in fig. 3, first, in step S301, a target element corresponding to a target paragraph in a target protocol text is determined from a plurality of preset elements; any element of the plurality of elements is a question related to a protocol text, and a prompter template corresponding to the element is preset.

In the case where compliance detection is required for the target protocol text, the plurality of elements may be a plurality of compliance elements. In one example, the target protocol text may be attributed to a certain one of a number of privacy data usage scenarios that are preset, for which the plurality of elements may be configured. The single private data usage scenario may be any one of the following: a private data collection scene, a private data transmission scene, a private data storage scene, a private data sharing scene and the like.

The prompter template corresponding to an element may be used to indicate that inferences are to be made regarding the element. In one example, the prompter template may include, for example, but is not limited to, instructions that include the element and a plurality of alternative answers. Further, the prompter template may further include at least one of: examples, slot markers for characterizing paragraphs, and the like. For a detailed explanation of the prompter template, reference may be made to the relevant description hereinbefore and will not be repeated here.

In one embodiment, the plurality of elements may correspond to keywords, respectively. The keywords may be included in their corresponding elements. Taking the compliance element of "whether there is an unreasonable disclaimer" as an example, the corresponding keyword may be "disclaimer", for example. Taking the compliance element of "whether to declare to collect personal name" as an example, the corresponding keyword may be "personal name", for example. For an element of the plurality of elements, if a keyword corresponding to the element is included in the paragraph content of the target paragraph, the element may be determined as the target element corresponding to the target paragraph.

In step S303, a target prompter is generated based on the target paragraph and the prompter template corresponding to the target element.

Specifically, the target paragraph and the corresponding prompter template of the target element may be spliced, for example, the prompter template is spliced to the head or tail of the target paragraph, and the splicing result is taken as the target prompter. Or when the prompter template corresponding to the target element comprises a slot mark used for representing the paragraph, the paragraph content of the target paragraph can be written into the position of the slot mark in the prompter template to obtain the target prompter.

Taking the example of a single promulgation template including examples, instructions, and slot markers for characterizing a paragraph, assume that the paragraph content of the target paragraph is "sixth disclaimer terms you understand and agree," banking one user general is not responsible (including but not limited to) if: 1. a bank-home-pass does not provide any form of assurance to the service, including but not limited to the service meeting your needs, the service being undisturbed, provided in time or protected from error. 2. The content and quality of service provided by the co-operating units of a bank's home are self-responsible by the co-operating units. 3. The bank-home-through does not guarantee the accuracy and integrity of the external links provided for the convenience of the member, while it does not assume any responsibility for the content on any web page to which the external links point that is not actually controlled by the bank-home-through. 4. For service change, interruption or termination caused by the fourth protocol, a bank is always not responsible. 5. The bank-home-through member operation instruction, which is not submitted by the user, causes loss because of the following conditions: 1. the instruction information is unknown, messy codes or incomplete and the like exist; 2. the products or services owned by the user fail, terminate and the like; 3. other cases of passing through a bank by one user. 6. For losses caused by computer viruses, trojans or other malicious programs, hacking. 7. The account is needed to be used safely, the user name and the password of the bank account are saved properly, and the bank account is not responsible for the loss caused by improper keeping. The target element corresponding to the target paragraph is "whether there is an unreasonable disclaimer", the target element corresponds to the prompter template P as described above, and the paragraph content of the target paragraph is written into the position of the slot mark "[ corresponding paragraph ] in the prompter template P, so that the target prompter as shown in fig. 4 can be obtained. Wherein, fig. 4 is a schematic diagram of a target prompter in the embodiment of the present specification.

In step S305, the target prompter is input into the large language model M, so that the large language model M makes inferences about the target element, and outputs the inference result.

Where a single promulgation template includes instructions as previously described, the inference results output by the large language model M may include an answer A1 selected from among the various alternative answers included in the instructions in the target promulgation. Further, the inference result may further include the content of the reasoning basis acting as the answer A1 in the paragraph content of the target paragraph.

Taking the target prompter shown in fig. 4 as an example, the target prompter is input into the large language model M, so that the large language model M can combine with the example "common unreasonable disclaimer expressions" in the target prompter are: the platform does not take any responsibility, the user independently takes any responsibility and the like, and by executing the instructions in the target prompter to request to judge whether unreasonable disclaimers exist in the following paragraphs or not, only answers including or not are needed to make reasoning about whether unreasonable disclaimers exist in the paragraph contents of the target paragraphs or not, and the reasoning result is output, wherein the reasoning result comprises an answer A1 selected from alternative answers including and not. It should be understood that answer A1 is "containing" or "not containing". Assuming answer A1 is "contained", the content of the paragraph content of the target paragraph that serves as the basis for reasoning about answer A1 may include, for example, "three" and "one-user-in-one-bank" do not guarantee the accuracy and integrity of the external links provided for the convenience of providing the member, while no responsibility is given to any web page to which the external links point that is not actually controlled by one-user-in-one-bank. 5. The bank-home-through-membership operation instruction submitted by you is not correctly executed to cause loss, and the bank-home-through does not take any responsibility.

According to the scheme provided by the embodiment corresponding to fig. 3, the target element corresponding to the target paragraph in the target protocol text can be determined from a plurality of preset elements; any element of the plurality of elements is a question related to a protocol text, and a prompter template corresponding to the element is preset. Then, a target prompter can be generated based on a prompter template corresponding to the target paragraph and the target element, and the target prompter is input into the large-scale language model, so that the large-scale language model performs reasoning related to the target element, and a reasoning result is output. Therefore, the method can generate the prompter for inputting the large language model based on the prompter template corresponding to the element, and realize automatic analysis of the protocol text in a few-sample/zero-sample scene by utilizing the generalization capability of the large language model, thereby remarkably reducing the analysis cost of the protocol text.

In one embodiment, in order to ensure the reliability of the inference results of the large language model M, after obtaining the inference results output by the large language model M, a manual review may be performed on the inference results. For example, the target prompter and the inference results may be provided to a results auditor, such that the results auditor audits the correctness of the answer A1 in the inference results. Based on this, the application scenario shown in fig. 1 may further include a results auditing device 102 used by a results auditor, and the protocol text detection engine 101 may communicate with the results auditing device 102.

In addition, in order to improve the reasoning accuracy of the large language model M, when the answer A1 does not pass the manual audit, information feedback may be performed to the prompter designer to prompt the prompter designer that the example in the prompter template corresponding to the target element is an error example, so that the prompter designer optimizes the prompter template. Based on this, the application scenario shown in fig. 1 may further include a prompter design apparatus 103 used by a prompter designer, and the protocol text detection engine 101 may communicate with the prompter design apparatus 103.

Next, in conjunction with fig. 5, a manual audit and feedback mechanism in an embodiment of the present description is described. FIG. 5 is a schematic diagram of a manual audit and feedback mechanism according to an embodiment of the present disclosure.

As shown in fig. 5, in step S501, the protocol text detection engine 101 transmits a target prompter and an inference result to the result auditing apparatus 102, so that the result auditing person audits the correctness of the answer A1 in the inference result.

It should be noted that, the result auditing device 102 may display the received target prompter and the reasoning result to the result auditor, so that the result auditor can audit the correctness of the answer A1 in the reasoning result based on the target prompter. In one example, to facilitate manual review, the reasoning results may also include content that acts as a reasoning basis for answer A1, and the results reviewer may review the correctness of answer A1 based on the target prompter and the content. After checking the correctness of the answer A1, the result checker may submit the checking result to the result checking device 102. And the auditing result is that the auditing is passed or not passed. After that, the result auditing apparatus 102 may execute step S503.

In one embodiment, when the auditing result is that the auditing is failed, the result auditor can modify the wrong answer A1 into a correct answer A2 and attribute the answer A2 to the auditing result. It will be appreciated that answer A2 is selected by the results auditor from among the various alternative answers included in the instructions in the target prompter.

In step S503, the result auditing device 102 transmits an auditing result to the protocol text detection engine 101, which is submitted by a result auditor after auditing the correctness of the answer A1 in the reasoning result.

In step S505, the protocol text detection engine 101 determines whether the auditing result is auditing pass.

As an example, a flag indicating that the audit is passed or not passed may be included in the audit result, and the protocol text detection engine 101 may determine whether the audit result is audit passed by identifying the flag. When the auditing result is that the auditing is not passed, the protocol text detection engine 101 may execute step S507. In addition, regardless of whether the audit result is audit passed or audit not passed, the protocol text detection engine 101 may execute step S509.

In step S507, the protocol text detection engine 101 transmits feedback information indicating that an example in the prompter template corresponding to the target compliance element is an error example to the prompter design apparatus 103 in response to the result of the audit being that the audit is not passed, so that the prompter designer optimizes the prompter template.

It should be noted that, after receiving the feedback information, the prompter design apparatus 103 may present the feedback information to a prompter designer, so that the prompter designer optimizes a prompter template corresponding to the target element, for example, optimizes an example in the prompter template. Taking the example of the target element corresponding to the prompter template T as described above, the prompter designer may, for example, modify the "common unreasonable disclaimer expressions" in the prompter template T: the "post content is modified to improve the inference accuracy of the large language model M.

In step S509, the protocol text detection engine 101 determines a target answer based on the result of the audit; when the auditing result is that the auditing is passed, the target answer is an answer A1; when the auditing result is that the auditing is not passed and the answer A2 provided by the result auditor is included, the target answer is the answer A2.

In step S511, the protocol text detection engine 101 determines a label corresponding to the target answer as a target label corresponding to the target element.

According to the foregoing description, each alternative answer included by the instruction in the prompter template may correspond to a label, respectively. After determining the target answer, the text detection engine 101 may map the target answer to its corresponding label, and determine the label as a target label corresponding to the target element.

The scheme provided by the corresponding embodiment of fig. 5 can ensure the reliability of the reasoning result of the large-scale language model M and improve the reasoning accuracy of the large-scale language model M through a manual auditing and feedback mechanism.

In one embodiment, the target protocol text may be used to normalize the operational behavior of user A on private data. Wherein the user a may be, for example, an enterprise user. The operational behavior may include, for example, any of the following: acquisition behavior, transmission behavior, storage behavior, sharing behavior, etc. A downstream engine (e.g., decision engine) of the protocol text detection engine 101 may be configured to detect operation behavior data generated based on the operation behavior of the user a, where the protocol text detection engine 101 determines each target element and a target label corresponding to each target element by detecting the target protocol text, and may be used for the downstream engine to refer to when detecting the operation behavior data.

Based on this, the application scenario shown in fig. 1 may further include a decision engine 104, and the protocol text detection engine 101 may provide each target element and its corresponding target label to the decision engine 104 for reference by the decision engine 104.

In one embodiment, the protocol text detection engine 101 may also store several rules related to the target protocol text, and a single rule may include obligation conditions related to the above-described operational behavior, and obligation requirements containing elements related to the protocol text. Wherein the element is included in the plurality of elements mentioned in step S301 as described above. In one example, when the plurality of elements is a plurality of compliance elements, the number of rules may be referred to as a number of compliance rules, which may be provided to the decision engine 104 for compliance detection of operational behavior data. The several rules may be designed by industry professionals in accordance with legal and related regulations. In the case where the plurality of elements are configured for a certain scene of a plurality of privacy data usage scenes as described above, the plurality of rules may also be configured for the scene.

In practice, when an enterprise wants to collect personal privacy data, related protocols need to make requirements on the data type and the like of the privacy data to be collected by the enterprise. Based on this, when the above-mentioned several rules are several compliance rules configured for the private data collection scene, the operation behavior of the user a on the private data may be the collection behavior, and the several compliance rules may include a compliance rule R1, a compliance rule R2, and the like; wherein, the obligation condition in the compliance rule R1 may include, for example, that the data type belongs to the personal name, and the obligation requirement may include, for example, whether the compliance element "declares the collected personal name" and a tag (e.g., "yes") corresponding to the compliance element; the obligation condition in the compliance rule R2 may include, for example, that the data type belongs to a personal identification, and the obligation requirement may include, for example, that the compliance element "whether collection of a personal identification is declared" and a tag (e.g., "yes") corresponding to the compliance element.

Next, a detection process of the operation behavior data will be described with reference to fig. 6. Fig. 6 is a schematic diagram of a detection process of operation behavior data in the embodiment of the present disclosure.

As shown in fig. 6, in step S601, the protocol text detection engine 101 transmits, to the decision engine 104, a plurality of rules related to the target protocol text, and each target element determined by detecting the target protocol text and a target label to which each target element corresponds.

In step S603, for a rule of the plurality of rules, if operation behavior data generated based on operation behavior of the user a on the privacy data satisfies an obligation condition in the rule, a target element included in an obligation requirement of the rule is determined in each target element, whether the operation behavior data satisfies the obligation requirement is determined based on a target tag corresponding to the target element, and a detection result is generated based on the determination result, the detection result being used to indicate whether the operation behavior data passes the rule.

It should be noted that in matching the operational behavior data with any one of the several rules described above, if the operational behavior data does not satisfy the obligation conditions in the any one rule, there is no need to continue matching the operational behavior data with the obligation requirements in the any one rule. If the operational behavior data meets the obligation conditions in either rule, then it is necessary to continue matching the operational behavior data with the obligation requirements in either rule.

Taking the compliance rule R1 as described above as an example, assuming that operation behavior data including a field name for characterizing a personal name, a field value corresponding to the field name, and the like are generated based on the collection of the personal name by the user a, it can be determined that the operation behavior data satisfies the obligation condition by matching the operation behavior data with the obligation condition in the compliance rule R1. Then, the target element included in the obligation requirement of the compliance rule R1, for example, "whether to claim to collect the personal name" may be determined among the respective target elements. Then, whether the operation behavior data satisfies the obligation requirement may be judged based on the target element "whether to claim to collect the personal name" and the corresponding target tag thereof, and the detection result may be generated based on the judgment result. It may be appreciated that when the target label corresponding to the target element is "yes", it may be determined that the operation behavior data meets the obligation requirement, so as to generate a detection result for indicating that the operation behavior data passes through the compliance rule R1, where the detection result may indicate that the collection of the personal name is declared in the target protocol text, and the collection behavior of the user a on the personal name is compliant. In addition, when the target label corresponding to the target element is "no", it may be determined that the operation behavior data does not meet the obligation requirement, so as to generate a detection result for indicating that the operation behavior data does not pass the compliance rule R1, where the detection result may indicate that the collection of the personal name is not declared in the target protocol text, and the collection behavior of the user a on the personal name is not compliant.

In step S605, the decision engine 104 generates a detection report based on the generated respective detection results.

Wherein the detection report may include the respective detection results. Further, the detection report may also include other information, including, for example, the number of rules that the operational behavior data passed and the number of rules that did not pass, the detection time, etc.

In the solution provided in the embodiment corresponding to fig. 6, the target label corresponding to the target element is a label corresponding to a target answer, where the target answer is determined based on the manual audit result, and has extremely high reliability. The method and the device have the advantages that a plurality of rules related to the target protocol text and each target element determined by detecting the target protocol text and the target label corresponding to each target element are provided for a decision engine for automatic detection of operation behavior data, so that the detection efficiency and the detection result accuracy can be effectively improved.

According to the description, the embodiment of the specification provides a protocol text automatic analysis scheme based on a large language model, which can solve the problem of protocol automatic analysis in a few-sample/zero-sample scene. The solution can take advantage of the generalization capability of large language models to solve this problem. The pre-trained large language model (such as GPT-3.5) has performed unsupervised learning on massive texts, and can complete the automatic analysis tasks of protocol texts in different types and fields by simply and effectively adjusting the prompter information, so that key information and logic relations are extracted from the protocol texts and are converted into structured data for the reasoning and judgment of a downstream decision engine. In addition, by adding a manual auditing and feedback mechanism, the reliability of the reasoning result can be ensured and the reasoning accuracy of the model is continuously improved. The scheme uses elements (such as compliance elements) and rules (such as compliance rules) as a definition mode of decision logic, so that the configurability and the expandability are increased.

Fig. 7 is a schematic structural diagram of a large language model-based protocol text detection apparatus in the embodiment of the present specification. The apparatus may be applied, for example, to the protocol text detection engine 101 as described above. The apparatus may perform the methods as shown in fig. 2, 3, 5, 6, respectively. The apparatus may include: a determining unit 701 configured to determine a target element corresponding to a target paragraph in the target protocol text from a plurality of preset elements; any element of the plurality of elements is a question related to a protocol text and is preset with a prompter template corresponding to the element; a generating unit 702 configured to generate a target prompter based on the target paragraph and a prompter template corresponding to the target element; an inference unit 703 configured to input the target prompter into the large language model, cause the large language model to perform inference about the target element, and output an inference result.

The present embodiments also provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed in a computer, causes the computer to perform the methods as shown in fig. 2, 3, 5, 6, respectively.

Embodiments of the present disclosure also provide a computing device including a memory and a processor, where the memory stores executable code that when executed by the processor implements the methods shown in fig. 2, 3, 5, and 6, respectively.

The present description also provides a computer program product, wherein the computer program product, when executed in a computer, causes the computer to perform the method as shown in fig. 2, 3, 5, 6, respectively.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation device is a server system. Of course, the present application does not exclude that as future computer technology evolves, the computer implementing the functions of the above-described embodiments may be, for example, a personal computer, a laptop computer, a car-mounted human-computer interaction device, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Although one or more embodiments of the present description provide method operational steps as described in the embodiments or flowcharts, more or fewer operational steps may be included based on conventional or non-inventive means. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When implemented in an actual device or end product, the instructions may be executed sequentially or in parallel (e.g., in a parallel processor or multi-threaded processing environment, or even in a distributed data processing environment) as illustrated by the embodiments or by the figures. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, it is not excluded that additional identical or equivalent elements may be present in a process, method, article, or apparatus that comprises a described element. For example, if first, second, etc. words are used to indicate a name, but not any particular order.

For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, when one or more of the present description is implemented, the functions of each module may be implemented in the same piece or pieces of software and/or hardware, or a module that implements the same function may be implemented by a plurality of sub-modules or a combination of sub-units, or the like. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage, graphene storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

One skilled in the relevant art will recognize that one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, one or more embodiments of the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

One or more embodiments of the present specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. One or more embodiments of the present description may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments. In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present specification. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

The foregoing is merely an example of one or more embodiments of the present specification and is not intended to limit the one or more embodiments of the present specification. Various modifications and alterations to one or more embodiments of this description will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, or the like, which is within the spirit and principles of the present specification, should be included in the scope of the claims.

Claims

1. A protocol text detection method based on a large language model comprises the following steps:

determining target elements corresponding to target paragraphs in the target protocol text from a plurality of preset elements; any element of the plurality of elements is a question related to a protocol text and is preset with a prompter template corresponding to the element;

generating a target prompter based on the target paragraph and a prompter template corresponding to the target element;

inputting the target prompter into the large-scale language model, so that the large-scale language model performs reasoning related to the target element, and outputting a reasoning result.

2. The method of claim 1, wherein the plurality of elements is a plurality of compliance elements.

3. The method of claim 1, wherein before determining the target element corresponding to the target paragraph in the target protocol text from the preset plurality of elements, further comprising:

And carrying out segmentation processing on the target protocol text to obtain a target paragraph of the target protocol text.

4. A method according to claim 3, wherein said segmenting said target protocol text comprises:

detecting the digital serial number marks of the target protocol text by using a preset regular expression for detecting various digital serial number marks, and extracting the detected digital serial number marks and the title text corresponding to the digital serial number marks from the target protocol text;

classifying the extracted digital serial number identifiers;

determining the corresponding level of each extracted title text based on the appearance sequence of each type of digital serial number identification in the target protocol text;

and carrying out segmentation processing on the target protocol text based on the levels corresponding to the title texts respectively.

5. The method of claim 1, wherein the plurality of elements correspond to keywords, respectively; and

the determining the target element corresponding to the target paragraph in the target protocol text from the preset multiple elements comprises the following steps:

and for an element in the plurality of elements, if the keyword corresponding to the element is contained in the paragraph content of the target paragraph, determining the element as the target element corresponding to the target paragraph.

6. The method of one of claims 1-5, wherein the prompter template corresponding to any element includes an example, an instruction containing the element and a plurality of alternative answers, and a slot marker for characterizing a paragraph; and

the generating a target prompter based on the target paragraph and the prompter template corresponding to the target element comprises the following steps:

and writing the paragraph content of the target paragraph into the position of the slot mark in the corresponding prompter template of the target element to obtain the target prompter.

7. The method of claim 6, wherein the inference result comprises a first answer selected from among the various alternative answers included by the instructions in the target prompter; and

the method further comprises the steps of:

providing the target prompter and the reasoning result to a first user so that the first user can check the correctness of the first answer;

and receiving the auditing result submitted by the first user.

8. The method of claim 7, wherein the respective corresponding prompter templates of the plurality of elements are configured by a second user in a prompter design apparatus; and

the method further comprises the steps of:

and responding to the auditing result that the auditing is not passed, and sending feedback information for indicating that the example in the corresponding prompter template of the target element is an error example to the prompter design device so as to enable the second user to optimize the prompter template.

9. The method of claim 7, wherein the respective alternative answers correspond to tags, respectively; and

the method further comprises the steps of:

determining a target answer based on the auditing result; when the auditing result is that the auditing is passed, the target answer is the first answer; when the auditing result is that the auditing is not passed and the second answer selected by the first user from the alternative answers is included, the target answer is the second answer;

and determining the label corresponding to the target answer as the target label corresponding to the target element.

10. The method of claim 9, wherein the target protocol text is used to normalize the operational behavior of a third user on private data and associates a number of rules that are preset, a single rule including an obligation condition related to the operational behavior and an obligation requirement containing an element of the plurality of elements; and

the method further comprises the steps of:

inputting each determined target element, a target label corresponding to each target element and the rules into a decision engine, and detecting operation behavior data generated based on the operation behaviors of the third user by the decision engine; the detection process includes:

For a rule in the plurality of rules, if the operation behavior data meets the obligation condition in the rule, determining a target element contained in the obligation requirement of the rule in each target element, judging whether the operation behavior data meets the obligation requirement based on a target label corresponding to the target element, and generating a detection result based on the judgment result, wherein the detection result is used for indicating whether the operation behavior data passes the rule;

and generating a detection report based on each generated detection result.

11. The method of claim 10, wherein the target protocol text is attributed to a certain one of a preset number of privacy data usage scenarios for which the plurality of elements and the number of rules are configured.

12. The method of claim 10, wherein the single privacy data usage scenario is any one of: a private data acquisition scene, a private data transmission scene, a private data storage scene and a private data sharing scene.

13. A large language model based protocol text detection apparatus comprising:

a determining unit configured to determine a target element corresponding to a target paragraph in the target protocol text from a plurality of preset elements; any element of the plurality of elements is a question related to a protocol text and is preset with a prompter template corresponding to the element;

A generating unit configured to generate a target prompter based on the target paragraph and a prompter template corresponding to the target element;

an inference unit configured to input the target prompter into the large language model, cause the large language model to perform inference related to the target element, and output an inference result.

14. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1-12.

15. A computing device comprising a memory having executable code stored therein and a processor, which when executing the executable code, implements the method of any of claims 1-12.