CN113282712A

CN113282712A - Text screening method, device, medium and equipment

Info

Publication number: CN113282712A
Application number: CN202110638387.3A
Authority: CN
Inventors: 方俊波
Original assignee: Ping An International Smart City Technology Co Ltd
Current assignee: Ping An International Smart City Technology Co Ltd
Priority date: 2021-06-08
Filing date: 2021-06-08
Publication date: 2021-08-20

Abstract

The present disclosure relates to a text screening method, apparatus, medium, and device, the method comprising: acquiring at least one text to be screened and a corresponding characteristic variable set, wherein the characteristic variable set comprises characteristic values of all characteristic variables set by enterprises to be screened relatively; performing morpheme analysis on the sample in the feature variable set of the text to be screened, and performing word segmentation processing to obtain a word set after word segmentation; selecting part of samples from all samples as test samples, and calculating the similarity of the test samples and the corresponding word sets; adjusting factors set according to experience are introduced in the process of calculating the similarity between the test sample and the corresponding word set; calculating the inverse text frequency index of each test sample and the word set by using a TF-IDF algorithm; and judging the result of text screening according to the similarity between the test sample and the word set and the inverse text frequency index.

Description

Text screening method, device, medium and equipment

Technical Field

The present disclosure relates to the field of text screening technologies, and more particularly, to a text screening method, apparatus, medium, and device.

Background

In the current era of vigorous development of artificial intelligence, companies are also laid out in the field of AI intelligent laws in order to follow the trend, contract texts are processed through a machine in a contract review model, which terms exist in the contracts are identified, and the nonexistent terms are marked and prompted, so that the aim of screening the same large batch of contracts is achieved, and the workload of manually reading a large number of contract documents is reduced.

The prior art uses a regular, semantic, TF-IDF algorithm form. In the regular aspect, a set of similar keywords are arranged by a developer reading a large number of similar terms, and whether the terms exist or not is judged by the keywords, so that the principle has the defects that the keywords arranged by the developer have subjectivity and low generalization capability; the semantic model level mainly uses word2vec to train word vectors, and achieves the purpose of clause identification through PCA dimension reduction, text similarity calculation and similarity size. In the TF-IDF aspect, the similarity score is obtained by calculating the importance degree of the word to the sample clause, and the overall accuracy is not high.

Disclosure of Invention

In order to solve the technical problem that the accuracy of a text screening model in the prior art is not high and cannot meet the requirements of users, the disclosure provides a text screening method, which comprises the following steps:

acquiring at least one text to be screened and a corresponding characteristic variable set, wherein the characteristic variable set comprises characteristic values of all characteristic variables set by enterprises to be screened relatively;

performing morpheme analysis on the sample in the feature variable set of the text to be screened, and performing word segmentation processing to obtain a word set after word segmentation;

selecting part of samples from all samples as test samples, and calculating the similarity of the test samples and the corresponding word sets; adjusting factors set according to experience are introduced in the process of calculating the similarity between the test sample and the corresponding word set;

calculating the inverse text frequency index of each test sample and the word set by using a TF-IDF algorithm;

and judging the result of text screening according to the similarity between the test sample and the word set and the inverse text frequency index.

Further, the word segmentation processing specifically adopts a conditional random field to perform word segmentation processing.

Further, the similarity of the test sample and the word set is calculated by specifically adopting the following formula:

wherein, R (qi, d) is the relevance score of the word set qi and the test sample d; fi is the frequency of occurrence of the word set qi in the test sample d, dl is the length of the test sample d, avgdl is the average length of all test samples d, and k and b are adjustment factors, set empirically.

Further, after selecting a part of samples in all samples as test samples and calculating the similarity between the test samples and the word set, the method further comprises:

the method for judging the screening result of the divided text according to the similarity of the test sample and the word set specifically comprises the following steps:

Further, the inverse text frequency index is specifically calculated by the following formula:

where N is the number of documents in total and N (qi) is the number of documents that contain the set of words qi.

Further, the judging the result of screening the divided texts according to the similarity between the test sample and the word set and the inverse text frequency index specifically includes:

according to the formula

Calculating the similarity between the sample Query and the test sample d, wherein Score (Query, d) represents the similarity between the sample Query and the test sample d;

and judging the result of text screening according to whether the similarity is greater than a preset threshold, wherein if the similarity is less than the preset threshold, the result of text screening is in accordance with the requirement, and if the similarity is greater than the threshold, the result of text screening is in accordance with the requirement.

In order to achieve the above technical object, the present disclosure can also provide a text screening apparatus, including:

the system comprises a text to be screened acquisition module, a text screening module and a feature variable set, wherein the text to be screened acquisition module is used for acquiring at least one text to be screened and a corresponding feature variable set, and the feature variable set comprises feature values of all feature variables set by an enterprise to be screened relatively;

the morpheme analysis module is used for performing morpheme analysis on the sample in the feature variable set of the text to be screened and performing word segmentation processing to obtain a word set after word segmentation;

the calculation module is used for selecting part of samples from all samples as test samples and calculating the similarity between the test samples and the corresponding word sets; adjusting factors set according to experience are introduced in the process of calculating the similarity between the test sample and the corresponding word set;

the inverse text frequency index calculation module is used for calculating the inverse text frequency index of each test sample and the word set by using a TF-IDF algorithm;

and the judgment and division module is used for judging the screening result of the divided text according to the similarity of the test sample and the word set and the inverse text frequency index.

according to the formula

To achieve the above technical objects, the present disclosure can also provide a computer storage medium having a computer program stored thereon, the computer program being executed by a processor to implement the steps of the above text screening method.

In order to achieve the above technical object, the present disclosure further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the text screening method when executing the computer program.

The beneficial effect of this disclosure does:

1. in the aspect of an integral model, full-text script analysis and calculation are adopted, so that the problem of weak generalization capability caused by human intervention is solved; because a machine learning algorithm is adopted, a large number of samples are not needed to calculate word vectors, and the problem that vector features are one-sided due to the fact that the number of samples is small is solved.

2. In the aspect of algorithm optimization, because the R (qi, d) correlation score and the adjustment factor are added, the scoring is more accurate, some noise data can be omitted, and the similarity score is more accurate.

3. In the aspect of similarity score measurement, all standard terms are included in the calculation range, multi-angle analysis is adopted, and weighting is further introduced, so that the representation is strong, a greater effect is exerted, and the term judgment accuracy is greatly improved.

4. In the aspect of data sampling, the uniformity of measuring, calculating and predicting data is ensured, and the influence on the scoring accuracy due to overlong or overlong character length is avoided.

Drawings

Fig. 1 shows a schematic flow diagram of embodiment 1 of the present disclosure;

fig. 2 shows a schematic structural diagram of embodiment 2 of the present disclosure;

fig. 3 shows a schematic structural diagram of embodiment 4 of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

Various structural schematics according to embodiments of the present disclosure are shown in the figures. The figures are not drawn to scale, wherein certain details are exaggerated and possibly omitted for clarity of presentation. The shapes of various regions, layers, and relative sizes and positional relationships therebetween shown in the drawings are merely exemplary, and deviations may occur in practice due to manufacturing tolerances or technical limitations, and a person skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions, as actually required.

The first embodiment is as follows:

as shown in fig. 1:

the present disclosure provides a text screening method, including:

s101: acquiring at least one text to be screened and a corresponding characteristic variable set, wherein the characteristic variable set comprises characteristic values of all characteristic variables set by enterprises to be screened relatively;

s102: performing morpheme analysis on the sample in the feature variable set of the text to be screened, and performing word segmentation processing to obtain a word set after word segmentation;

s103: selecting part of samples from all samples as test samples, and calculating the similarity of the test samples and the corresponding word sets; adjusting factors set according to experience are introduced in the process of calculating the similarity between the test sample and the corresponding word set;

s104: calculating the inverse text frequency index of each test sample and the word set by using a TF-IDF algorithm;

s105: and judging the result of text screening according to the similarity between the test sample and the word set and the inverse text frequency index.

When the word segmentation is performed, the word segmentation can be performed by adopting a conventional word segmentation technical means, for example, the word segmentation is performed by using a conditional random field CRF:

conditional Random Fields (CRFs), a discriminative probabilistic model, are used for labeling or analyzing sequence data, such as natural language text or biological sequences. The conditional random field is a conditional probability distribution model P (Y | X) representing a markov random field of another set of output random variables Y given a set of input random variables X, i.e., the CRF is characterized by assuming that the output random variables constitute a markov random field. Conditional random fields can be viewed as a generalization of the maximum entropy markov model over the labeling problem.

Like a markov random field, a conditional random field is a graph model with no direction, in which the distribution of a random variable Y is a conditional probability and a given observation is a random variable X. In principle, the graph model layout of the conditional random field can be arbitrarily given, and a general layout is a chained architecture, which has a more efficient algorithm for calculation no matter in training (training), inference (inference), or decoding (decoding). The conditional random field is a typical discriminant model, and the joint probability thereof can be written in the form of multiplication of several potential functions, wherein the most common is the linear chain element random field.

For example, assuming that the obtained industry and business name is "peace science and technology (shenzhen) limited", the participles constituting the industry and business name can be "peace", "science and technology", "shenzhen", "limited" and "company", respectively, by performing participle processing on the industry and business name.

Calculating the inverse text frequency index of each test sample d and the word set qi by using a TF-IDF algorithm;

TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. TF is Term Frequency (Term Frequency) and IDF is Inverse text Frequency index (Inverse Document Frequency).

TF-IDF is a statistical method to evaluate the importance of a word to one of a set of documents or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of relevance between a document and a user query. In addition to TF-IDF, search engines on the internet use a ranking method based on link analysis to determine the order in which documents appear in search results.

The TF-IDF model is an information retrieval model widely used in practical applications such as search engines, but various questions have been raised about the TF-IDF model. The core idea of the box-and-bead model based on the conditional probability for the information retrieval problem is to convert the matching problem of the query string q and the document d into the conditional probability problem that the query string q comes from the document d. It defines a clearer target for information retrieval problems from a probabilistic perspective than the degree of matching expressed by the TF-IDF model.

according to the formula

The preset threshold value is obtained through calculation, and the specific calculation process is as follows:

firstly, calculating the similarity value by using all the test samples d, selecting a proper numerical value as a calculated threshold value, and judging whether the clause exists by using the calculated threshold value.

Embodiments of the present disclosure may set different weights for different scoring items.

Furthermore, the scoring items and the weights can be adjusted according to the industry of the enterprise. For example; for the mining industry, the application takes registered capital, branch number, recruitment letter, bidding letter, external investment number, listing type, stockholder number, legal action number, business abnormal constant and high management number as the scoring items of the enterprises in the industry. Meanwhile, the method increases the weight of registered capital and legal dissimilarity constant in the scoring items of the enterprises in the mining industry. For the internet industry, the application takes registered capital, branch number, recruitment letter, bidding letter, external investment number, listing type, stockholder number, patent number, copyright number, trademark number, legal action number, business abnormal constant and high management number as the scoring items of the enterprise of the industry. Meanwhile, the weights of the patent number, the copyright weight and the trademark number in the scoring items of enterprises in the Internet industry are increased.

The technical scheme of the disclosure is characterized in that the innovation is divided into three points in data and logic:

firstly, on the level of sample combing, the construction of stop word segmentation terms is carried out, different stop words are added according to different terms before the similarity is calculated, so that the positioning is more accurate, and the traditional method replaces the traditional method that the same set of stop words are used for all samples. The specific implementation flow is as follows:

screening all the appearing words of the same label and calculating the appearing times;

calculating the number of each word appearing in other clauses;

calculating stop words for a single clause (removing words that appear very rarely in clauses but much in other clauses);

the stop word for each clause is written to the model.

In the actual use of machine learning text classification algorithms (e.g., lda, bayes, k-means), it is found that preprocessing of documents is very important, and if too many miscellaneous words are included, the effect of the algorithm is often greatly reduced. In the preprocessing, stop word filtering is a very critical step, but at present, the stop word filtering is not completely automatically realized in one step, and the stop words in different fields may be changed. But some general rules may be followed, plus manual intervention, that should be able to achieve good results.

And secondly, calculating the similarity scores of each test term and all sample terms in a similarity score calculation layer, weighting each score, and finally calculating an average value. Each sample clause similarity score is equivalent to a characteristic angle, the sample clauses are analyzed and tested from multiple angles, the increasing weight with strong representativeness is analyzed, the decreasing weight with weak representativeness is analyzed, and all scores are finally normalized.

And thirdly, in a data sampling level, abandoning the logic of one text in each section, calculating the average character length of each term for each sample term, and controlling the text in the average length range during training and prediction.

Example two:

as shown in figure 2 of the drawings, in which,

the present disclosure can also provide a text screening apparatus, including:

the text to be screened acquiring module 201 is configured to acquire at least one text to be screened and a corresponding feature variable set, where the feature variable set includes feature values of all feature variables set by an enterprise to be screened;

a morpheme analysis module 202, configured to perform morpheme analysis on a sample in the feature variable set of the text to be screened, and perform word segmentation processing to obtain a word set after word segmentation;

the calculating module 203 is configured to select a part of samples from all samples as test samples, and calculate similarity between the test samples and corresponding word sets; adjusting factors set according to experience are introduced in the process of calculating the similarity between the test sample and the corresponding word set;

the inverse text frequency index calculation module 204 is configured to calculate an inverse text frequency index of each test sample and the word set by using a TF-IDF algorithm;

and the judging and dividing module 205 is configured to judge a result of text screening according to the similarity between the test sample and the word set and the inverse text frequency index.

The text to be filtered acquisition module 201 is sequentially connected to the morpheme analysis module 202, the calculation module 203, the inverse text frequency index calculation module 204, and the judgment and division module 205.

according to the formula

Example three:

the present disclosure can also provide a computer storage medium having stored thereon a computer program for implementing the steps of the text screening method described above when executed by a processor.

The computer storage medium of the present disclosure may be implemented using semiconductor memory or magnetic core memory.

Semiconductor memories are mainly used as semiconductor memory elements of computers, and there are two types, Mos and bipolar memory elements. Mos devices have high integration, simple process, but slow speed. The bipolar element has the advantages of complex process, high power consumption, low integration level and high speed. NMos and CMos were introduced to make Mos memory dominate in semiconductor memory. NMos is fast, e.g. 45ns for 1K bit sram from intel. The CMos power consumption is low, and the access time of the 4K-bit CMos static memory is 300 ns. The semiconductor memories described above are all Random Access Memories (RAMs), i.e. read and write new contents randomly during operation. And a semiconductor Read Only Memory (ROM), which can be read out randomly but cannot be written in during operation, is used to store solidified programs and data. The ROM is classified into a non-rewritable fuse type ROM, PROM, and a rewritable EPROM.

The magnetic core memory has the characteristics of low cost and high reliability, and has more than 20 years of practical use experience. Magnetic core memories were widely used as main memories before the mid 70's. The storage capacity can reach more than 10 bits, and the access time is 300ns at the fastest speed. The typical international magnetic core memory has a capacity of 4 MS-8 MB and an access cycle of 1.0-1.5 mus. After semiconductor memory is rapidly developed to replace magnetic core memory as a main memory location, magnetic core memory can still be applied as a large-capacity expansion memory.

Example four:

the present disclosure also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the steps of the text screening method are implemented.

Fig. 3 is a schematic diagram of an internal structure of an electronic device in one embodiment. As shown in fig. 3, the electronic device includes a processor, a storage medium, a memory, and a network interface connected through a system bus. The storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions can enable the processor to realize a text screening method when being executed by the processor. The processor of the electrical device is used to provide computing and control capabilities to support the operation of the entire computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform a text screening method. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 3 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

The electronic device includes, but is not limited to, a smart phone, a computer, a tablet, a wearable smart device, an artificial smart device, a mobile power source, and the like.

The processor may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor is a Control Unit of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device by running or executing programs or modules (for example, executing remote data reading and writing programs, etc.) stored in the memory and calling data stored in the memory.

The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connected communication between the memory and at least one processor or the like.

Fig. 3 shows only an electronic device having components, and those skilled in the art will appreciate that the structure shown in fig. 3 does not constitute a limitation of the electronic device, and may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.

For example, although not shown, the electronic device may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor through a power management device, so that functions such as charge management, discharge management, and power consumption management are implemented through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

Further, the electronic device may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used to establish a communication connection between the electronic device and other electronic devices.

Optionally, the electronic device may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.

Further, the computer-usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

The beneficial effect of this disclosure does:

3. In the aspect of stop word optimization, different stop words aiming at different terms are added, so that the condition of 'one-pole death' is avoided, the stop words are effectively controlled to interfere with each other between the terms, and the positioning is clearer.

4. In the aspect of similarity score measurement, all standard terms are included in the calculation range, multi-angle analysis is adopted, and weighting is further introduced, so that the representation is strong, a greater effect is exerted, and the term judgment accuracy is greatly improved.

5. In the aspect of data sampling, the uniformity of measuring, calculating and predicting data is ensured, and the influence on the scoring accuracy due to overlong or overlong character length is avoided.

The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. A text screening method, comprising:

2. The method of claim 1, wherein the word segmentation process is performed using conditional random fields.

3. The method according to claim 1, wherein the calculating the similarity between the test sample and the term set specifically adopts the following formula to perform similarity calculation:

4. The method according to claim 1, wherein the inverse text frequency index is calculated by the following formula:

5. The method according to claim 4, wherein the judging the result of the classified text screening according to the similarity between the test sample and the word set and the inverse text frequency index specifically comprises:

according to the formula

6. A text screening apparatus, comprising:

7. The apparatus according to claim 6, wherein the word segmentation process is performed by using conditional random fields.

8. The apparatus of claim 6, wherein the calculating the similarity between the test sample and the word set specifically uses the following formula to perform the similarity calculation:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method according to any of claims 1 to 5 are performed when the computer program is executed by the processor.

10. A computer storage medium having computer program instructions stored thereon, wherein the program instructions, when executed by a processor, are adapted to perform the steps corresponding to the text screening method of any of claims 1 to 5.