CN114090798A

CN114090798A - Text duplicate removal method and device, computer storage medium and electronic equipment

Info

Publication number: CN114090798A
Application number: CN202111342221.3A
Authority: CN
Inventors: 潘仕江
Original assignee: Yancheng Jindi Technology Co Ltd
Current assignee: Yancheng Tianyanchawei Technology Co ltd
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2022-02-25

Abstract

The embodiment of the application provides a text duplicate removal method and device, a computer storage medium and electronic equipment, wherein the text duplicate removal method comprises the following steps: determining a plurality of texts to be processed which are associated with the same judicial bulletin case number, wherein the texts to be processed comprise characteristic description data of judicial events; extracting time features describing judicial events from the feature description data; and determining repeated texts to be processed based on the extracted time characteristics and performing de-duplication processing on the repeated texts to be processed, thereby realizing the de-duplication processing on the texts to be processed capable of analyzing the judicial data.

Description

Text duplicate removal method and device, computer storage medium and electronic equipment

Technical Field

The application relates to the technical field of data processing, in particular to a text duplicate removal method and device, a computer storage medium and electronic equipment.

Background

Based on a big data solution, a series of deep mining such as cleaning analysis and sorting is performed on the collected enterprise data, so that data comprehensive query or classified query service is provided, for example, enterprise-related information including judicial data is queried. Evasive risk may be selected for subsequent partners choose of the enterprise based on judicial data, partner enterprise credits analyzed to determine whether to collaborate further, and so on. However, due to the numerous sources of internet data, there are numerous instances of duplication of data that can be analyzed for forensic data.

Disclosure of Invention

Embodiments of the present application provide a text deduplication method and apparatus, a computer storage medium, and an electronic device, so as to overcome or alleviate the above technical problems in the prior art.

The technical scheme adopted by the application is as follows:

a method of de-duplicating text, comprising:

determining a plurality of texts to be processed which are associated with the same judicial bulletin case number, wherein the texts to be processed comprise characteristic description data of judicial events;

extracting time features describing judicial events from the feature description data;

and determining repeated texts to be processed and carrying out deduplication processing on the repeated texts to be processed based on the extracted time characteristics.

Optionally, in an embodiment, the extracting, from the feature description data, a temporal feature describing a judicial event includes: and performing regular matching on the feature description data based on a regular expression for extracting the time features so as to extract the time features for describing judicial events from the feature description data.

Optionally, in an embodiment, the determining and de-duplicating a repeated text to be processed based on the extracted temporal features includes:

determining repeated texts to be processed based on the extracted time characteristics, and adding a first label to the repeated texts to be processed;

and based on the added first label, carrying out duplicate removal processing on the repeated text to be processed.

determining similarity between temporal features extracted from different feature description data;

and in response to the judgment result that the similarity is smaller than the set similarity threshold, judging the corresponding at least two texts to be processed into repeated texts to be processed, and performing deduplication processing on the repeated texts to be processed.

Optionally, in an embodiment, the determining the similarity between the time features extracted from the different feature description data includes: based on the set feature description period, the similarity between the time features extracted from different feature description data in the same feature description period is statistically determined.

Optionally, in an embodiment, the method further includes:

preliminarily judging the text to be processed to be non-repeated according to the extracted time characteristics, and extracting personnel characteristics for describing judicial events from the characteristic description data of the text to be processed to be non-repeated;

and determining the text to be processed which is actually repeated in the non-repeated text to be processed based on the extracted personnel characteristics, and performing deduplication processing on the text to be processed.

Optionally, in an embodiment, determining, based on the extracted person features, an actually repeated text to be processed in the non-repeated text to be processed, and performing deduplication processing on the text to be processed, includes:

determining an actual repeated text to be processed in the non-repeated text to be processed based on the extracted personnel features, and adding a second label to the actual repeated text to be processed;

and based on the added second label, carrying out duplicate removal processing on the actually repeated text to be processed.

Optionally, in an embodiment, the performing, based on the added second label, deduplication processing on the actually repeated text to be processed includes:

and if the second labels of the two texts to be processed are the same, judging that the two texts to be processed are the same as the repeated second labels of the texts to be processed, judging that the two texts to be processed are actually repeated texts to be processed, and reserving the texts to be processed with evaluation values larger than a set evaluation threshold value, wherein the evaluation values comprise at least one of timeliness evaluation values, authority evaluation values and information quantity evaluation values.

Optionally, in an embodiment, the extracting, from the corresponding feature description data, a person feature describing a judicial event includes: and performing regular matching on the feature description data of the non-repeated text to be processed based on the regular expression for extracting the personnel features so as to extract the personnel features for describing judicial events.

Optionally, in an embodiment, the determining multiple texts to be processed associated with the same judicial bulletin case number, where the texts to be processed include feature description data of a judicial event, includes:

normalizing the obtained judicial bulletin board number to enable the expression of the judicial bulletin board number to accord with a normalized expression rule;

and traversing the judicial event bulletin library by using the judicial bulletin case number after the normalization processing so as to retrieve a plurality of texts to be processed which are related to the same judicial bulletin case number.

Optionally, in an embodiment, the time characteristic is a legal time, and the human characteristic is a legal party: or, the time characteristic is legal party, and the person characteristic is legal time.

Optionally, in an embodiment, the text to be processed is a text corresponding to a court announcement or a referee document, or a text including feature description data in the court announcement or the referee document.

A device for de-duplicating text, comprising:

the system comprises a text acquisition unit, a processing unit and a processing unit, wherein the text acquisition unit is used for determining a plurality of texts to be processed which are related to the same judicial bulletin case number, and the texts to be processed comprise characteristic description data of judicial events;

the key feature extraction unit is used for extracting the time features describing the judicial events from the feature description data;

and the repeated text determining unit is used for determining repeated texts to be processed and carrying out deduplication processing on the repeated texts to be processed based on the extracted time characteristics.

A computer storage medium having stored thereon a computer executable program, the computer executable program being operative to perform a method as in any one of the embodiments of the present application.

An electronic device comprising a memory for storing thereon a computer-executable program and a processor for executing the computer-executable program to implement the method of any of the embodiments of the present application.

Determining a plurality of texts to be processed related to the same judicial bulletin case number, wherein the texts to be processed comprise characteristic description data of judicial events; extracting time features describing judicial events from the feature description data; and determining repeated texts to be processed based on the extracted time characteristics and performing de-duplication processing on the repeated texts to be processed, thereby realizing the de-duplication processing on the texts to be processed capable of analyzing the judicial data.

Drawings

FIG. 1 is a schematic view of a scenario in which a user uses an application according to an embodiment of the present application;

FIG. 2 is a schematic flowchart of a text deduplication method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart illustrating a text deduplication method according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a text deduplication apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

To make the technical problems, technical solutions and advantages to be solved by the present application clearer, the following detailed description is made with reference to the accompanying drawings and specific embodiments.

FIG. 1 is a schematic view of a scenario in which a user uses an application according to an embodiment of the present application; as shown in fig. 1, the application scenario is directed to a data query system, where the data query system includes a terminal 101 and an application server 102, where the application server 102 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides basic cloud computing services such as cloud service, cloud database, cloud computing, cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, and big data and artificial intelligence platform. The terminal 101 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal 101 and the application server 102 may be directly or indirectly connected through a wireless communication manner (such as a network), and the application is not limited herein.

In order to ensure the response speed and efficiency of the query, the application server 102 is provided with a reference database, an ES database and a detail database, wherein the reference database stores reference ES data and detail data, the ES database stores ES data, the detail database stores detail data corresponding to the ES data, and the ES data stored by the ES database and the detail data stored by the detail database are synchronized from the reference database. When the user uses the application program to be tested installed on the terminal to carry out data query, the retrieved result data is directly from the ES database and the detail database.

Here, the server of the data correctness verifying apparatus is not particularly limited, and may be, for example, the same physical service but logically separate, or may be a different physical server.

The following embodiments produce ES data associated with judicial data and corresponding detail data based on the deduplicated text to be processed and further based on a data production system.

In the embodiments described below, the execution subject of the method may be a data production system, such as a data processing server in the data production system.

In the following embodiments, the text to be processed is a text corresponding to a court announcement or a referee document, or a text including feature description data in the court announcement or the referee document. The court announcements and the referee documents are issued by judicial authorities, and the texts containing the characteristic description data in the court announcements or the referee documents are, for example, the court announcements and the referee documents are further subjected to data mining to form third-party texts, and certainly include related news reports and the like. Or, the text to be processed is also plan information, delivery notice and the like.

FIG. 2 is a schematic flowchart of a text deduplication method according to an embodiment of the present application; as shown in fig. 2, it includes:

s201, determining a plurality of texts to be processed related to the same judicial bulletin case number, wherein the texts to be processed comprise characteristic description data of judicial events;

in this embodiment, if all the texts to be processed are collected and the to-be-processed text library is formed, the judicial bulletin case number may be used to search in the to-be-processed text library, so as to determine a plurality of texts to be processed with the same judicial bulletin case number.

In this embodiment, the feature description data of the judicial event includes any data that can characterize the judicial event, so that the judicial event is different from other non-judicial events.

S202, extracting time characteristics describing judicial events from the characteristic description data;

optionally, the extracting the temporal feature describing the judicial event from the feature description data includes: extracting temporal features describing judicial events from the feature description data based on an extraction model used to extract the temporal features.

In this embodiment, the time characteristic may be partial data in the data representing the judicial event.

Further, the extracting the temporal feature describing the judicial event from the feature description data based on the extraction model for extracting the temporal feature comprises: and performing regular matching on the feature description data based on a regular expression for extracting the time features to extract the time features describing judicial events from the feature description data, wherein the regular expression is used as the extraction model.

Specifically, regular expressions may be built based on the format or representation of the temporal features. Different temporal features configure different regular expressions.

In other embodiments, the extraction model may also be a neural network model, an expert system model, or a text recognition model, and may be flexibly selected according to the requirements of the application scenario.

S203, determining repeated texts to be processed based on the extracted time characteristics and carrying out deduplication processing on the repeated texts to be processed.

Optionally, in this embodiment, the determining repeated texts to be processed and performing deduplication processing on the repeated texts to be processed based on the extracted temporal features includes:

By adding the first label, the duplicate removal processing is conveniently and quickly realized. Preferably, the first label added for repeated text to be processed is the same. Of course, in other embodiments, the first labels added to the repeated texts to be processed may also be the same, and for this purpose, a mapping relationship between the first labels is established to represent that the texts to be processed are repeated texts.

Specifically, the determining and de-duplicating repeated texts to be processed based on the extracted temporal features may include:

By the method based on the similarity between the time characteristics, the judgment of the repeated texts to be processed is quickly and simply realized, the data processing efficiency is improved, and the accuracy is ensured.

In combination with the manner of adding the first label, the determining and de-duplicating repeated texts to be processed based on the extracted temporal features may include:

in response to a judgment result that the similarity is smaller than a set similarity threshold, judging at least two corresponding texts to be processed into repeated texts to be processed, and adding a first label to the repeated texts to be processed;

By the method for adding the first label and based on the similarity, the efficiency of data processing is further improved, and meanwhile, the accuracy is guaranteed.

Optionally, in an embodiment, the performing, based on the added first label, a deduplication process on the repeated text to be processed includes: and if the first labels of the two texts to be processed are the same, judging that the two texts to be processed are repeated texts to be processed, and reserving the texts to be processed with a first evaluation value larger than a set first evaluation threshold value, wherein the first evaluation value comprises at least one of a timeliness evaluation value, an authority evaluation value and an information content evaluation value.

Optionally, in an embodiment, the determining the similarity between the time features extracted from the different feature description data includes: based on the set feature description period, the similarity between the time features extracted from different feature description data in the same feature description period is determined statistically, so that feature data for which the similarity is calculated is reduced, and the data processing efficiency is improved.

Illustratively, in a specific application scenario, the text to be processed is a court announcement text, the time characteristic is a court time recorded in the court announcement text, the similarity is a difference between the court times recorded in different court announcement texts, that is, in the case of repeated determination, it is considered that the court times are as identical as possible if the text to be processed is repeated, and for this reason, it is considered that the court times are likely to open on the same natural day, and therefore, the above-mentioned characteristic description period is a natural day, that is, a difference between the court times on the same day is counted to represent the similarity of the court times, and the above-mentioned similarity threshold is a time difference threshold, which is, in general, 1 hour, for example, and if the difference between the two court times is less than or equal to 1 hour, it can be determined that the corresponding two court announcements are repeated.

In summary, the repeated determination of the text to be processed is realized through the time characteristics to perform deduplication, and when deduplication is performed, for example, only one text to be processed is reserved, for example, only one text to be processed with the highest first evaluation value is reserved, such as the text to be processed from an authority.

FIG. 3 is a schematic flow chart illustrating a text deduplication method according to an embodiment of the present application; different from the foregoing embodiment 2, in order to further aim at a text to be processed that may substantially repeat even in a non-repeated text to be processed obtained through the foregoing embodiment of fig. 2, a processing step of performing a determination based on a human characteristic is added, specifically, as shown in fig. 3, the processing step includes:

s301, determining a plurality of texts to be processed related to the same judicial bulletin case number, wherein the texts to be processed comprise characteristic description data of judicial events;

s302, extracting time characteristics describing judicial events from the characteristic description data;

s303, determining repeated texts to be processed based on the extracted time characteristics and carrying out deduplication processing on the repeated texts to be processed.

In this embodiment, steps S301 to S302 are similar to steps S201 to S203 described above, and are not described again in detail. Of course, in light of the present application, those skilled in the art can implement the steps S201 to S203 differently from the above without departing from the spirit of the present application.

S304, preliminarily judging the text to be processed to be non-repeated according to the extracted time characteristics, and extracting personnel characteristics for describing judicial events from the characteristic description data of the text to be processed to be non-repeated;

in this embodiment, as described above, since there may be a situation that the text to be processed actually belongs to a repeated text to be processed in the non-repeated text to be processed obtained after the duplication removal based on the time feature, the person feature is extracted in step S304 to further perform the actual repeated determination.

In this embodiment, the extracting of the person feature describing the judicial event from the feature description data of the non-repetitive text to be processed includes: and extracting the personnel features describing the judicial events from the feature description data of the non-repeated text to be processed based on an extraction model for extracting the personnel features.

Further, the extracting of the person feature describing the judicial event from the feature description data of the non-repeated text to be processed includes: and performing regular matching on the feature description data of the non-repeated text to be processed based on the regular expression for extracting the personnel features so as to extract the personnel features for describing judicial events.

Here, specifically, the regular expression may be established based on the format or expression manner of the person feature. Different personnel features configure different regular expressions.

S305, determining the text to be processed which is actually repeated in the non-repeated text to be processed based on the extracted personnel features, and performing deduplication processing on the text to be processed.

In the embodiment, the preliminary deduplication is realized equivalently based on the time characteristics, and the secondary deduplication is realized equivalently based on the personnel characteristics, so that the deduplication efficiency is improved, and the possible repeated text to be processed is deduplicated.

In this embodiment, determining an actually repeated text to be processed in the non-repeated text to be processed based on the extracted person features, and performing deduplication processing on the text to be processed, includes:

By adding the second label in a manner similar to the manner of adding the first label, the deduplication processing can be realized quickly.

determining similarity between the person features extracted from the different feature description data;

and in response to a judgment result that the similarity is smaller than a set similarity threshold, judging the corresponding at least two non-repeated texts to be processed into repeated texts to be processed, and performing de-duplication processing on the repeated texts.

Through the above manner based on the similarity between the personnel characteristics, the judgment of the repeated texts to be processed is quickly and simply realized, the data processing efficiency is improved, and the accuracy is ensured.

In combination with the above manner of adding the second label, the determining and de-duplicating the repeated text to be processed based on the extracted human features may include:

in response to a judgment result that the similarity is smaller than a set similarity threshold, judging at least two corresponding non-repeated texts to be processed into repeated texts to be processed, and adding a second label to the repeated texts to be processed;

and based on the added second label, carrying out duplicate removal processing on the repeated text to be processed.

Through the mode of adding the second label and based on the similarity, the efficiency of data processing is further improved, and meanwhile, the accuracy is guaranteed.

Optionally, in this embodiment, the performing, based on the added second label, deduplication processing on an actually repeated text to be processed includes:

and if the second labels of the two texts to be processed are the same, judging that the two texts to be processed are actually repeated texts to be processed, and reserving the texts to be processed with evaluation values larger than a set evaluation threshold value, wherein the evaluation values comprise at least one of timeliness evaluation values, authority evaluation values and information quantity evaluation values.

Illustratively, in a specific application scenario, if the text to be processed is a trial announcement, the time characteristic may be trial time, and the person characteristic is a party, that is, when the determination is repeated, it is considered that if the text to be processed is repeated, the trial time is first as same as possible, for this reason, it is considered that the trial may be opened multiple times on the same natural day, and therefore, the characteristic description period is a natural day, that is, the difference between the trial time on the same day is counted to represent the similarity of the trial time, the similarity threshold is a time difference threshold, which is typically 1 hour, and if the difference between the two trial time is less than or equal to 1 hour, the corresponding two trial announcements may be determined to be repeated. Otherwise, if the existence of the non-repeated texts to be processed is preliminarily judged through the open time, but the texts to be processed are substantially repeated, for this reason, repeated judgment is further carried out based on the party, if the name similarity of the party is greater than the set name similarity threshold, the corresponding two texts to be processed are judged to be repeated, otherwise, the texts to be processed are judged not to be repeated.

Further, considering the multiple sources of the judicial bulletin board numbers, there are differences in the ways of describing the judicial bulletin board numbers from different sources, so in the above embodiment, in the determining a plurality of texts to be processed associated with the same judicial bulletin board number, the texts to be processed include feature description data of the judicial events, and before the determining, the method includes:

Through the judicial bulletin case number after the normalization processing, the corresponding texts to be processed are searched as completely as possible. Of course, if there is no case where there is a difference in the manner of expression of the judicial bulletin board number, the process of performing the above-described normalization processing may not be necessary.

FIG. 4 is a schematic structural diagram of a text deduplication apparatus according to an embodiment of the present application; as shown in fig. 4, it includes:

a text obtaining unit 401, configured to determine multiple texts to be processed that are associated with the same judicial bulletin case number, where the texts to be processed include feature description data of judicial events;

a key feature extraction unit 402, configured to extract a temporal feature describing a judicial event from the feature description data;

a repeated text determining unit 403, configured to determine a repeated text to be processed based on the extracted temporal features and perform deduplication processing on the repeated text to be processed.

Optionally, in an embodiment, the key feature extracting unit 402 is configured to: and performing regular matching on the feature description data based on a regular expression for extracting the time features to extract the time features describing judicial events from the feature description data, wherein the regular expression is used as the extraction model.

Optionally, in an embodiment, the repeated text determining unit 403 is specifically configured to:

Optionally, in an embodiment, the repeated text determining unit 403 is specifically configured to: and if the first labels of the two texts to be processed are the same, judging that the two texts to be processed are repeated texts to be processed, and reserving the texts to be processed with a first evaluation value larger than a set first evaluation threshold value, wherein the first evaluation value comprises at least one of a timeliness evaluation value, an authority evaluation value and an information content evaluation value.

Optionally, in an embodiment, the repeated text determining unit 403 is specifically configured to: based on the set feature description period, the similarity between the time features extracted from different feature description data in the same feature description period is statistically determined.

Optionally, in an embodiment, the key feature extracting unit 402 is further configured to:

for the extracted time features, primarily judging the text to be processed to be non-repeated, and extracting personnel features for describing judicial events from corresponding feature description data;

the repeated text determination unit 403 is further configured to: and determining the text to be processed which is actually repeated in the non-repeated text to be processed based on the extracted personnel characteristics, and performing deduplication processing on the text to be processed.

Optionally, in an embodiment, the repeated text determining unit 403 is further specifically configured to:

Optionally, in an embodiment, the key feature extracting unit 402 is further specifically configured to: and performing regular matching on the feature description data corresponding to the non-repeated text to be processed based on the regular expression for extracting the personnel features so as to extract the personnel features for describing judicial events.

Optionally, in an embodiment, the apparatus further includes a normalization processing unit, configured to:

Embodiments of the present application further provide a computer storage medium having a computer executable program stored thereon, where the computer executable program is executed to implement the method according to any one of the embodiments of the present application.

FIG. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application; as shown in fig. 5, the electronic device includes a memory 501 and a processor 502, the memory 501 is used for storing a computer-executable program, and the processor 502 is used for running the computer-executable program to implement the method of any one of the above embodiments.

The above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for de-duplicating text, comprising:

2. The method of claim 1, wherein said extracting temporal features describing judicial events from said feature description data comprises: and performing regular matching on the feature description data based on a regular expression for extracting the time features so as to extract the time features for describing judicial events from the feature description data.

3. The method of claim 1, wherein determining and de-duplicating repeated text to be processed based on the extracted temporal features comprises:

4. The method of claim 3, wherein the performing de-duplication processing on the repeated text to be processed based on the added first annotation comprises: and if the first labels of the two texts to be processed are the same, judging that the two texts to be processed are repeated texts to be processed, and reserving the texts to be processed with a first evaluation value larger than a set first evaluation threshold value, wherein the first evaluation value comprises at least one of a timeliness evaluation value, an authority evaluation value and an information content evaluation value.

5. The method of claim 1, wherein determining and de-duplicating repeated text to be processed based on the extracted temporal features comprises:

6. The method of claim 5, wherein determining similarity between temporal features extracted from different feature description data comprises: based on the set feature description period, the similarity between the time features extracted from different feature description data in the same feature description period is statistically determined.

7. The method according to claim 6, wherein the text to be processed is a court announcement text, the time characteristic is a court time recorded in the court announcement text, and the similarity is a difference in the court time between the court times recorded in different court announcement texts.

8. The method according to any one of claims 1-7, further comprising:

9. The method according to claim 8, wherein the determining and de-duplicating an actually duplicated text to be processed in the non-duplicated text to be processed based on the extracted human features comprises:

10. The method of claim 8, wherein the de-duplication of the actually repeated text to be processed based on the added second label comprises:

11. The method of claim 8, wherein said extracting person features describing judicial incidents from corresponding feature description data comprises: and performing regular matching on the feature description data corresponding to the non-repeated text to be processed based on the regular expression for extracting the personnel features so as to extract the personnel features for describing judicial events.

12. The method of any of claims 8-11, wherein the human characteristic is a party name.

13. The method according to any one of claims 1-12, wherein said determining a plurality of pending texts associated with the same judicial bulletin board number, said pending texts comprising feature description data of judicial incidents, preceded by:

14. A device for removing duplicate text, comprising:

15. A computer storage medium having a computer-executable program stored thereon, the computer-executable program being executed to implement the method of any one of claims 1-13.

16. An electronic device, comprising a memory for storing a computer-executable program and a processor for executing the computer-executable program to perform the method of any one of claims 1-13.