CN114090798A - Text duplicate removal method and device, computer storage medium and electronic equipment - Google Patents

Text duplicate removal method and device, computer storage medium and electronic equipment Download PDF

Info

Publication number
CN114090798A
CN114090798A CN202111342221.3A CN202111342221A CN114090798A CN 114090798 A CN114090798 A CN 114090798A CN 202111342221 A CN202111342221 A CN 202111342221A CN 114090798 A CN114090798 A CN 114090798A
Authority
CN
China
Prior art keywords
processed
texts
repeated
text
judicial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111342221.3A
Other languages
Chinese (zh)
Inventor
潘仕江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yancheng Tianyanchawei Technology Co ltd
Original Assignee
Yancheng Jindi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yancheng Jindi Technology Co Ltd filed Critical Yancheng Jindi Technology Co Ltd
Priority to CN202111342221.3A priority Critical patent/CN114090798A/en
Publication of CN114090798A publication Critical patent/CN114090798A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Tourism & Hospitality (AREA)
  • Economics (AREA)
  • Technology Law (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a text duplicate removal method and device, a computer storage medium and electronic equipment, wherein the text duplicate removal method comprises the following steps: determining a plurality of texts to be processed which are associated with the same judicial bulletin case number, wherein the texts to be processed comprise characteristic description data of judicial events; extracting time features describing judicial events from the feature description data; and determining repeated texts to be processed based on the extracted time characteristics and performing de-duplication processing on the repeated texts to be processed, thereby realizing the de-duplication processing on the texts to be processed capable of analyzing the judicial data.

Description

Text duplicate removal method and device, computer storage medium and electronic equipment
Technical Field
The application relates to the technical field of data processing, in particular to a text duplicate removal method and device, a computer storage medium and electronic equipment.
Background
Based on a big data solution, a series of deep mining such as cleaning analysis and sorting is performed on the collected enterprise data, so that data comprehensive query or classified query service is provided, for example, enterprise-related information including judicial data is queried. Evasive risk may be selected for subsequent partners choose of the enterprise based on judicial data, partner enterprise credits analyzed to determine whether to collaborate further, and so on. However, due to the numerous sources of internet data, there are numerous instances of duplication of data that can be analyzed for forensic data.
Disclosure of Invention
Embodiments of the present application provide a text deduplication method and apparatus, a computer storage medium, and an electronic device, so as to overcome or alleviate the above technical problems in the prior art.
The technical scheme adopted by the application is as follows:
a method of de-duplicating text, comprising:
determining a plurality of texts to be processed which are associated with the same judicial bulletin case number, wherein the texts to be processed comprise characteristic description data of judicial events;
extracting time features describing judicial events from the feature description data;
and determining repeated texts to be processed and carrying out deduplication processing on the repeated texts to be processed based on the extracted time characteristics.
Optionally, in an embodiment, the extracting, from the feature description data, a temporal feature describing a judicial event includes: and performing regular matching on the feature description data based on a regular expression for extracting the time features so as to extract the time features for describing judicial events from the feature description data.
Optionally, in an embodiment, the determining and de-duplicating a repeated text to be processed based on the extracted temporal features includes:
determining repeated texts to be processed based on the extracted time characteristics, and adding a first label to the repeated texts to be processed;
and based on the added first label, carrying out duplicate removal processing on the repeated text to be processed.
Optionally, in an embodiment, the determining and de-duplicating a repeated text to be processed based on the extracted temporal features includes:
determining similarity between temporal features extracted from different feature description data;
and in response to the judgment result that the similarity is smaller than the set similarity threshold, judging the corresponding at least two texts to be processed into repeated texts to be processed, and performing deduplication processing on the repeated texts to be processed.
Optionally, in an embodiment, the determining the similarity between the time features extracted from the different feature description data includes: based on the set feature description period, the similarity between the time features extracted from different feature description data in the same feature description period is statistically determined.
Optionally, in an embodiment, the method further includes:
preliminarily judging the text to be processed to be non-repeated according to the extracted time characteristics, and extracting personnel characteristics for describing judicial events from the characteristic description data of the text to be processed to be non-repeated;
and determining the text to be processed which is actually repeated in the non-repeated text to be processed based on the extracted personnel characteristics, and performing deduplication processing on the text to be processed.
Optionally, in an embodiment, determining, based on the extracted person features, an actually repeated text to be processed in the non-repeated text to be processed, and performing deduplication processing on the text to be processed, includes:
determining an actual repeated text to be processed in the non-repeated text to be processed based on the extracted personnel features, and adding a second label to the actual repeated text to be processed;
and based on the added second label, carrying out duplicate removal processing on the actually repeated text to be processed.
Optionally, in an embodiment, the performing, based on the added second label, deduplication processing on the actually repeated text to be processed includes:
and if the second labels of the two texts to be processed are the same, judging that the two texts to be processed are the same as the repeated second labels of the texts to be processed, judging that the two texts to be processed are actually repeated texts to be processed, and reserving the texts to be processed with evaluation values larger than a set evaluation threshold value, wherein the evaluation values comprise at least one of timeliness evaluation values, authority evaluation values and information quantity evaluation values.
Optionally, in an embodiment, the extracting, from the corresponding feature description data, a person feature describing a judicial event includes: and performing regular matching on the feature description data of the non-repeated text to be processed based on the regular expression for extracting the personnel features so as to extract the personnel features for describing judicial events.
Optionally, in an embodiment, the determining multiple texts to be processed associated with the same judicial bulletin case number, where the texts to be processed include feature description data of a judicial event, includes:
normalizing the obtained judicial bulletin board number to enable the expression of the judicial bulletin board number to accord with a normalized expression rule;
and traversing the judicial event bulletin library by using the judicial bulletin case number after the normalization processing so as to retrieve a plurality of texts to be processed which are related to the same judicial bulletin case number.
Optionally, in an embodiment, the time characteristic is a legal time, and the human characteristic is a legal party: or, the time characteristic is legal party, and the person characteristic is legal time.
Optionally, in an embodiment, the text to be processed is a text corresponding to a court announcement or a referee document, or a text including feature description data in the court announcement or the referee document.
A device for de-duplicating text, comprising:
the system comprises a text acquisition unit, a processing unit and a processing unit, wherein the text acquisition unit is used for determining a plurality of texts to be processed which are related to the same judicial bulletin case number, and the texts to be processed comprise characteristic description data of judicial events;
the key feature extraction unit is used for extracting the time features describing the judicial events from the feature description data;
and the repeated text determining unit is used for determining repeated texts to be processed and carrying out deduplication processing on the repeated texts to be processed based on the extracted time characteristics.
A computer storage medium having stored thereon a computer executable program, the computer executable program being operative to perform a method as in any one of the embodiments of the present application.
An electronic device comprising a memory for storing thereon a computer-executable program and a processor for executing the computer-executable program to implement the method of any of the embodiments of the present application.
Determining a plurality of texts to be processed related to the same judicial bulletin case number, wherein the texts to be processed comprise characteristic description data of judicial events; extracting time features describing judicial events from the feature description data; and determining repeated texts to be processed based on the extracted time characteristics and performing de-duplication processing on the repeated texts to be processed, thereby realizing the de-duplication processing on the texts to be processed capable of analyzing the judicial data.
Drawings
FIG. 1 is a schematic view of a scenario in which a user uses an application according to an embodiment of the present application;
FIG. 2 is a schematic flowchart of a text deduplication method according to an embodiment of the present application;
FIG. 3 is a schematic flow chart illustrating a text deduplication method according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of a text deduplication apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
To make the technical problems, technical solutions and advantages to be solved by the present application clearer, the following detailed description is made with reference to the accompanying drawings and specific embodiments.
FIG. 1 is a schematic view of a scenario in which a user uses an application according to an embodiment of the present application; as shown in fig. 1, the application scenario is directed to a data query system, where the data query system includes a terminal 101 and an application server 102, where the application server 102 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides basic cloud computing services such as cloud service, cloud database, cloud computing, cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, and big data and artificial intelligence platform. The terminal 101 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal 101 and the application server 102 may be directly or indirectly connected through a wireless communication manner (such as a network), and the application is not limited herein.
In order to ensure the response speed and efficiency of the query, the application server 102 is provided with a reference database, an ES database and a detail database, wherein the reference database stores reference ES data and detail data, the ES database stores ES data, the detail database stores detail data corresponding to the ES data, and the ES data stored by the ES database and the detail data stored by the detail database are synchronized from the reference database. When the user uses the application program to be tested installed on the terminal to carry out data query, the retrieved result data is directly from the ES database and the detail database.
Here, the server of the data correctness verifying apparatus is not particularly limited, and may be, for example, the same physical service but logically separate, or may be a different physical server.
The following embodiments produce ES data associated with judicial data and corresponding detail data based on the deduplicated text to be processed and further based on a data production system.
In the embodiments described below, the execution subject of the method may be a data production system, such as a data processing server in the data production system.
In the following embodiments, the text to be processed is a text corresponding to a court announcement or a referee document, or a text including feature description data in the court announcement or the referee document. The court announcements and the referee documents are issued by judicial authorities, and the texts containing the characteristic description data in the court announcements or the referee documents are, for example, the court announcements and the referee documents are further subjected to data mining to form third-party texts, and certainly include related news reports and the like. Or, the text to be processed is also plan information, delivery notice and the like.
FIG. 2 is a schematic flowchart of a text deduplication method according to an embodiment of the present application; as shown in fig. 2, it includes:
s201, determining a plurality of texts to be processed related to the same judicial bulletin case number, wherein the texts to be processed comprise characteristic description data of judicial events;
in this embodiment, if all the texts to be processed are collected and the to-be-processed text library is formed, the judicial bulletin case number may be used to search in the to-be-processed text library, so as to determine a plurality of texts to be processed with the same judicial bulletin case number.
In this embodiment, the feature description data of the judicial event includes any data that can characterize the judicial event, so that the judicial event is different from other non-judicial events.
S202, extracting time characteristics describing judicial events from the characteristic description data;
optionally, the extracting the temporal feature describing the judicial event from the feature description data includes: extracting temporal features describing judicial events from the feature description data based on an extraction model used to extract the temporal features.
In this embodiment, the time characteristic may be partial data in the data representing the judicial event.
Further, the extracting the temporal feature describing the judicial event from the feature description data based on the extraction model for extracting the temporal feature comprises: and performing regular matching on the feature description data based on a regular expression for extracting the time features to extract the time features describing judicial events from the feature description data, wherein the regular expression is used as the extraction model.
Specifically, regular expressions may be built based on the format or representation of the temporal features. Different temporal features configure different regular expressions.
In other embodiments, the extraction model may also be a neural network model, an expert system model, or a text recognition model, and may be flexibly selected according to the requirements of the application scenario.
S203, determining repeated texts to be processed based on the extracted time characteristics and carrying out deduplication processing on the repeated texts to be processed.
Optionally, in this embodiment, the determining repeated texts to be processed and performing deduplication processing on the repeated texts to be processed based on the extracted temporal features includes:
determining repeated texts to be processed based on the extracted time characteristics, and adding a first label to the repeated texts to be processed;
and based on the added first label, carrying out duplicate removal processing on the repeated text to be processed.
By adding the first label, the duplicate removal processing is conveniently and quickly realized. Preferably, the first label added for repeated text to be processed is the same. Of course, in other embodiments, the first labels added to the repeated texts to be processed may also be the same, and for this purpose, a mapping relationship between the first labels is established to represent that the texts to be processed are repeated texts.
Specifically, the determining and de-duplicating repeated texts to be processed based on the extracted temporal features may include:
determining similarity between temporal features extracted from different feature description data;
and in response to the judgment result that the similarity is smaller than the set similarity threshold, judging the corresponding at least two texts to be processed into repeated texts to be processed, and performing deduplication processing on the repeated texts to be processed.
By the method based on the similarity between the time characteristics, the judgment of the repeated texts to be processed is quickly and simply realized, the data processing efficiency is improved, and the accuracy is ensured.
In combination with the manner of adding the first label, the determining and de-duplicating repeated texts to be processed based on the extracted temporal features may include:
determining similarity between temporal features extracted from different feature description data;
in response to a judgment result that the similarity is smaller than a set similarity threshold, judging at least two corresponding texts to be processed into repeated texts to be processed, and adding a first label to the repeated texts to be processed;
and based on the added first label, carrying out duplicate removal processing on the repeated text to be processed.
By the method for adding the first label and based on the similarity, the efficiency of data processing is further improved, and meanwhile, the accuracy is guaranteed.
Optionally, in an embodiment, the performing, based on the added first label, a deduplication process on the repeated text to be processed includes: and if the first labels of the two texts to be processed are the same, judging that the two texts to be processed are repeated texts to be processed, and reserving the texts to be processed with a first evaluation value larger than a set first evaluation threshold value, wherein the first evaluation value comprises at least one of a timeliness evaluation value, an authority evaluation value and an information content evaluation value.
Optionally, in an embodiment, the determining the similarity between the time features extracted from the different feature description data includes: based on the set feature description period, the similarity between the time features extracted from different feature description data in the same feature description period is determined statistically, so that feature data for which the similarity is calculated is reduced, and the data processing efficiency is improved.
Illustratively, in a specific application scenario, the text to be processed is a court announcement text, the time characteristic is a court time recorded in the court announcement text, the similarity is a difference between the court times recorded in different court announcement texts, that is, in the case of repeated determination, it is considered that the court times are as identical as possible if the text to be processed is repeated, and for this reason, it is considered that the court times are likely to open on the same natural day, and therefore, the above-mentioned characteristic description period is a natural day, that is, a difference between the court times on the same day is counted to represent the similarity of the court times, and the above-mentioned similarity threshold is a time difference threshold, which is, in general, 1 hour, for example, and if the difference between the two court times is less than or equal to 1 hour, it can be determined that the corresponding two court announcements are repeated.
In summary, the repeated determination of the text to be processed is realized through the time characteristics to perform deduplication, and when deduplication is performed, for example, only one text to be processed is reserved, for example, only one text to be processed with the highest first evaluation value is reserved, such as the text to be processed from an authority.
FIG. 3 is a schematic flow chart illustrating a text deduplication method according to an embodiment of the present application; different from the foregoing embodiment 2, in order to further aim at a text to be processed that may substantially repeat even in a non-repeated text to be processed obtained through the foregoing embodiment of fig. 2, a processing step of performing a determination based on a human characteristic is added, specifically, as shown in fig. 3, the processing step includes:
s301, determining a plurality of texts to be processed related to the same judicial bulletin case number, wherein the texts to be processed comprise characteristic description data of judicial events;
s302, extracting time characteristics describing judicial events from the characteristic description data;
s303, determining repeated texts to be processed based on the extracted time characteristics and carrying out deduplication processing on the repeated texts to be processed.
In this embodiment, steps S301 to S302 are similar to steps S201 to S203 described above, and are not described again in detail. Of course, in light of the present application, those skilled in the art can implement the steps S201 to S203 differently from the above without departing from the spirit of the present application.
S304, preliminarily judging the text to be processed to be non-repeated according to the extracted time characteristics, and extracting personnel characteristics for describing judicial events from the characteristic description data of the text to be processed to be non-repeated;
in this embodiment, as described above, since there may be a situation that the text to be processed actually belongs to a repeated text to be processed in the non-repeated text to be processed obtained after the duplication removal based on the time feature, the person feature is extracted in step S304 to further perform the actual repeated determination.
In this embodiment, the extracting of the person feature describing the judicial event from the feature description data of the non-repetitive text to be processed includes: and extracting the personnel features describing the judicial events from the feature description data of the non-repeated text to be processed based on an extraction model for extracting the personnel features.
Further, the extracting of the person feature describing the judicial event from the feature description data of the non-repeated text to be processed includes: and performing regular matching on the feature description data of the non-repeated text to be processed based on the regular expression for extracting the personnel features so as to extract the personnel features for describing judicial events.
Here, specifically, the regular expression may be established based on the format or expression manner of the person feature. Different personnel features configure different regular expressions.
In other embodiments, the extraction model may also be a neural network model, an expert system model, or a text recognition model, and may be flexibly selected according to the requirements of the application scenario.
S305, determining the text to be processed which is actually repeated in the non-repeated text to be processed based on the extracted personnel features, and performing deduplication processing on the text to be processed.
In the embodiment, the preliminary deduplication is realized equivalently based on the time characteristics, and the secondary deduplication is realized equivalently based on the personnel characteristics, so that the deduplication efficiency is improved, and the possible repeated text to be processed is deduplicated.
In this embodiment, determining an actually repeated text to be processed in the non-repeated text to be processed based on the extracted person features, and performing deduplication processing on the text to be processed, includes:
determining an actual repeated text to be processed in the non-repeated text to be processed based on the extracted personnel features, and adding a second label to the actual repeated text to be processed;
and based on the added second label, carrying out duplicate removal processing on the actually repeated text to be processed.
By adding the second label in a manner similar to the manner of adding the first label, the deduplication processing can be realized quickly.
Specifically, the determining and de-duplicating repeated texts to be processed based on the extracted temporal features may include:
determining similarity between the person features extracted from the different feature description data;
and in response to a judgment result that the similarity is smaller than a set similarity threshold, judging the corresponding at least two non-repeated texts to be processed into repeated texts to be processed, and performing de-duplication processing on the repeated texts.
Through the above manner based on the similarity between the personnel characteristics, the judgment of the repeated texts to be processed is quickly and simply realized, the data processing efficiency is improved, and the accuracy is ensured.
In combination with the above manner of adding the second label, the determining and de-duplicating the repeated text to be processed based on the extracted human features may include:
determining similarity between the person features extracted from the different feature description data;
in response to a judgment result that the similarity is smaller than a set similarity threshold, judging at least two corresponding non-repeated texts to be processed into repeated texts to be processed, and adding a second label to the repeated texts to be processed;
and based on the added second label, carrying out duplicate removal processing on the repeated text to be processed.
Through the mode of adding the second label and based on the similarity, the efficiency of data processing is further improved, and meanwhile, the accuracy is guaranteed.
Optionally, in this embodiment, the performing, based on the added second label, deduplication processing on an actually repeated text to be processed includes:
and if the second labels of the two texts to be processed are the same, judging that the two texts to be processed are actually repeated texts to be processed, and reserving the texts to be processed with evaluation values larger than a set evaluation threshold value, wherein the evaluation values comprise at least one of timeliness evaluation values, authority evaluation values and information quantity evaluation values.
Illustratively, in a specific application scenario, if the text to be processed is a trial announcement, the time characteristic may be trial time, and the person characteristic is a party, that is, when the determination is repeated, it is considered that if the text to be processed is repeated, the trial time is first as same as possible, for this reason, it is considered that the trial may be opened multiple times on the same natural day, and therefore, the characteristic description period is a natural day, that is, the difference between the trial time on the same day is counted to represent the similarity of the trial time, the similarity threshold is a time difference threshold, which is typically 1 hour, and if the difference between the two trial time is less than or equal to 1 hour, the corresponding two trial announcements may be determined to be repeated. Otherwise, if the existence of the non-repeated texts to be processed is preliminarily judged through the open time, but the texts to be processed are substantially repeated, for this reason, repeated judgment is further carried out based on the party, if the name similarity of the party is greater than the set name similarity threshold, the corresponding two texts to be processed are judged to be repeated, otherwise, the texts to be processed are judged not to be repeated.
Further, considering the multiple sources of the judicial bulletin board numbers, there are differences in the ways of describing the judicial bulletin board numbers from different sources, so in the above embodiment, in the determining a plurality of texts to be processed associated with the same judicial bulletin board number, the texts to be processed include feature description data of the judicial events, and before the determining, the method includes:
normalizing the obtained judicial bulletin board number to enable the expression of the judicial bulletin board number to accord with a normalized expression rule;
and traversing the judicial event bulletin library by using the judicial bulletin case number after the normalization processing so as to retrieve a plurality of texts to be processed which are related to the same judicial bulletin case number.
Through the judicial bulletin case number after the normalization processing, the corresponding texts to be processed are searched as completely as possible. Of course, if there is no case where there is a difference in the manner of expression of the judicial bulletin board number, the process of performing the above-described normalization processing may not be necessary.
FIG. 4 is a schematic structural diagram of a text deduplication apparatus according to an embodiment of the present application; as shown in fig. 4, it includes:
a text obtaining unit 401, configured to determine multiple texts to be processed that are associated with the same judicial bulletin case number, where the texts to be processed include feature description data of judicial events;
a key feature extraction unit 402, configured to extract a temporal feature describing a judicial event from the feature description data;
a repeated text determining unit 403, configured to determine a repeated text to be processed based on the extracted temporal features and perform deduplication processing on the repeated text to be processed.
Optionally, in an embodiment, the key feature extracting unit 402 is configured to: and performing regular matching on the feature description data based on a regular expression for extracting the time features to extract the time features describing judicial events from the feature description data, wherein the regular expression is used as the extraction model.
Optionally, in an embodiment, the repeated text determining unit 403 is specifically configured to:
determining repeated texts to be processed based on the extracted time characteristics, and adding a first label to the repeated texts to be processed;
and based on the added first label, carrying out duplicate removal processing on the repeated text to be processed.
Optionally, in an embodiment, the repeated text determining unit 403 is specifically configured to: and if the first labels of the two texts to be processed are the same, judging that the two texts to be processed are repeated texts to be processed, and reserving the texts to be processed with a first evaluation value larger than a set first evaluation threshold value, wherein the first evaluation value comprises at least one of a timeliness evaluation value, an authority evaluation value and an information content evaluation value.
Optionally, in an embodiment, the repeated text determining unit 403 is specifically configured to:
determining similarity between temporal features extracted from different feature description data;
and in response to the judgment result that the similarity is smaller than the set similarity threshold, judging the corresponding at least two texts to be processed into repeated texts to be processed, and performing deduplication processing on the repeated texts to be processed.
Optionally, in an embodiment, the repeated text determining unit 403 is specifically configured to: based on the set feature description period, the similarity between the time features extracted from different feature description data in the same feature description period is statistically determined.
Optionally, in an embodiment, the key feature extracting unit 402 is further configured to:
for the extracted time features, primarily judging the text to be processed to be non-repeated, and extracting personnel features for describing judicial events from corresponding feature description data;
the repeated text determination unit 403 is further configured to: and determining the text to be processed which is actually repeated in the non-repeated text to be processed based on the extracted personnel characteristics, and performing deduplication processing on the text to be processed.
Optionally, in an embodiment, the repeated text determining unit 403 is further specifically configured to:
determining an actual repeated text to be processed in the non-repeated text to be processed based on the extracted personnel features, and adding a second label to the actual repeated text to be processed;
and based on the added second label, carrying out duplicate removal processing on the actually repeated text to be processed.
Optionally, in an embodiment, the repeated text determining unit 403 is further specifically configured to:
and if the second labels of the two texts to be processed are the same, judging that the two texts to be processed are actually repeated texts to be processed, and reserving the texts to be processed with evaluation values larger than a set evaluation threshold value, wherein the evaluation values comprise at least one of timeliness evaluation values, authority evaluation values and information quantity evaluation values.
Optionally, in an embodiment, the key feature extracting unit 402 is further specifically configured to: and performing regular matching on the feature description data corresponding to the non-repeated text to be processed based on the regular expression for extracting the personnel features so as to extract the personnel features for describing judicial events.
Optionally, in an embodiment, the apparatus further includes a normalization processing unit, configured to:
normalizing the obtained judicial bulletin board number to enable the expression of the judicial bulletin board number to accord with a normalized expression rule;
and traversing the judicial event bulletin library by using the judicial bulletin case number after the normalization processing so as to retrieve a plurality of texts to be processed which are related to the same judicial bulletin case number.
Embodiments of the present application further provide a computer storage medium having a computer executable program stored thereon, where the computer executable program is executed to implement the method according to any one of the embodiments of the present application.
FIG. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application; as shown in fig. 5, the electronic device includes a memory 501 and a processor 502, the memory 501 is used for storing a computer-executable program, and the processor 502 is used for running the computer-executable program to implement the method of any one of the above embodiments.
The above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (16)

1. A method for de-duplicating text, comprising:
determining a plurality of texts to be processed which are associated with the same judicial bulletin case number, wherein the texts to be processed comprise characteristic description data of judicial events;
extracting time features describing judicial events from the feature description data;
and determining repeated texts to be processed and carrying out deduplication processing on the repeated texts to be processed based on the extracted time characteristics.
2. The method of claim 1, wherein said extracting temporal features describing judicial events from said feature description data comprises: and performing regular matching on the feature description data based on a regular expression for extracting the time features so as to extract the time features for describing judicial events from the feature description data.
3. The method of claim 1, wherein determining and de-duplicating repeated text to be processed based on the extracted temporal features comprises:
determining repeated texts to be processed based on the extracted time characteristics, and adding a first label to the repeated texts to be processed;
and based on the added first label, carrying out duplicate removal processing on the repeated text to be processed.
4. The method of claim 3, wherein the performing de-duplication processing on the repeated text to be processed based on the added first annotation comprises: and if the first labels of the two texts to be processed are the same, judging that the two texts to be processed are repeated texts to be processed, and reserving the texts to be processed with a first evaluation value larger than a set first evaluation threshold value, wherein the first evaluation value comprises at least one of a timeliness evaluation value, an authority evaluation value and an information content evaluation value.
5. The method of claim 1, wherein determining and de-duplicating repeated text to be processed based on the extracted temporal features comprises:
determining similarity between temporal features extracted from different feature description data;
and in response to the judgment result that the similarity is smaller than the set similarity threshold, judging the corresponding at least two texts to be processed into repeated texts to be processed, and performing deduplication processing on the repeated texts to be processed.
6. The method of claim 5, wherein determining similarity between temporal features extracted from different feature description data comprises: based on the set feature description period, the similarity between the time features extracted from different feature description data in the same feature description period is statistically determined.
7. The method according to claim 6, wherein the text to be processed is a court announcement text, the time characteristic is a court time recorded in the court announcement text, and the similarity is a difference in the court time between the court times recorded in different court announcement texts.
8. The method according to any one of claims 1-7, further comprising:
for the extracted time features, primarily judging the text to be processed to be non-repeated, and extracting personnel features for describing judicial events from corresponding feature description data;
and determining the text to be processed which is actually repeated in the non-repeated text to be processed based on the extracted personnel characteristics, and performing deduplication processing on the text to be processed.
9. The method according to claim 8, wherein the determining and de-duplicating an actually duplicated text to be processed in the non-duplicated text to be processed based on the extracted human features comprises:
determining an actual repeated text to be processed in the non-repeated text to be processed based on the extracted personnel features, and adding a second label to the actual repeated text to be processed;
and based on the added second label, carrying out duplicate removal processing on the actually repeated text to be processed.
10. The method of claim 8, wherein the de-duplication of the actually repeated text to be processed based on the added second label comprises:
and if the second labels of the two texts to be processed are the same, judging that the two texts to be processed are actually repeated texts to be processed, and reserving the texts to be processed with evaluation values larger than a set evaluation threshold value, wherein the evaluation values comprise at least one of timeliness evaluation values, authority evaluation values and information quantity evaluation values.
11. The method of claim 8, wherein said extracting person features describing judicial incidents from corresponding feature description data comprises: and performing regular matching on the feature description data corresponding to the non-repeated text to be processed based on the regular expression for extracting the personnel features so as to extract the personnel features for describing judicial events.
12. The method of any of claims 8-11, wherein the human characteristic is a party name.
13. The method according to any one of claims 1-12, wherein said determining a plurality of pending texts associated with the same judicial bulletin board number, said pending texts comprising feature description data of judicial incidents, preceded by:
normalizing the obtained judicial bulletin board number to enable the expression of the judicial bulletin board number to accord with a normalized expression rule;
and traversing the judicial event bulletin library by using the judicial bulletin case number after the normalization processing so as to retrieve a plurality of texts to be processed which are related to the same judicial bulletin case number.
14. A device for removing duplicate text, comprising:
the system comprises a text acquisition unit, a processing unit and a processing unit, wherein the text acquisition unit is used for determining a plurality of texts to be processed which are related to the same judicial bulletin case number, and the texts to be processed comprise characteristic description data of judicial events;
the key feature extraction unit is used for extracting the time features describing the judicial events from the feature description data;
and the repeated text determining unit is used for determining repeated texts to be processed and carrying out deduplication processing on the repeated texts to be processed based on the extracted time characteristics.
15. A computer storage medium having a computer-executable program stored thereon, the computer-executable program being executed to implement the method of any one of claims 1-13.
16. An electronic device, comprising a memory for storing a computer-executable program and a processor for executing the computer-executable program to perform the method of any one of claims 1-13.
CN202111342221.3A 2021-11-12 2021-11-12 Text duplicate removal method and device, computer storage medium and electronic equipment Pending CN114090798A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111342221.3A CN114090798A (en) 2021-11-12 2021-11-12 Text duplicate removal method and device, computer storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111342221.3A CN114090798A (en) 2021-11-12 2021-11-12 Text duplicate removal method and device, computer storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN114090798A true CN114090798A (en) 2022-02-25

Family

ID=80300376

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111342221.3A Pending CN114090798A (en) 2021-11-12 2021-11-12 Text duplicate removal method and device, computer storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN114090798A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8682906B1 (en) * 2013-01-23 2014-03-25 Splunk Inc. Real time display of data field values based on manual editing of regular expressions
US20150180891A1 (en) * 2013-12-19 2015-06-25 Splunk Inc. Using network locations obtained from multiple threat lists to evaluate network data or machine data
CN108959305A (en) * 2017-05-22 2018-12-07 北京国信宏数科技有限公司 A kind of event extraction method and system based on internet big data
CN109684628A (en) * 2018-11-23 2019-04-26 武汉烽火众智数字技术有限责任公司 Case intelligently pushing method and system based on merit semantic analysis
CN112035653A (en) * 2020-11-05 2020-12-04 北京智源人工智能研究院 Policy key information extraction method and device, storage medium and electronic equipment
CN112733909A (en) * 2020-12-31 2021-04-30 北京软通智慧城市科技有限公司 Duplicate removal identification method, device, medium and electronic equipment for urban cases
US20210209112A1 (en) * 2020-04-27 2021-07-08 Beijing Baidu Netcom Science And Technology Co., Ltd. Text query method and apparatus, device and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8682906B1 (en) * 2013-01-23 2014-03-25 Splunk Inc. Real time display of data field values based on manual editing of regular expressions
US20150180891A1 (en) * 2013-12-19 2015-06-25 Splunk Inc. Using network locations obtained from multiple threat lists to evaluate network data or machine data
CN108959305A (en) * 2017-05-22 2018-12-07 北京国信宏数科技有限公司 A kind of event extraction method and system based on internet big data
CN109684628A (en) * 2018-11-23 2019-04-26 武汉烽火众智数字技术有限责任公司 Case intelligently pushing method and system based on merit semantic analysis
US20210209112A1 (en) * 2020-04-27 2021-07-08 Beijing Baidu Netcom Science And Technology Co., Ltd. Text query method and apparatus, device and storage medium
CN112035653A (en) * 2020-11-05 2020-12-04 北京智源人工智能研究院 Policy key information extraction method and device, storage medium and electronic equipment
CN112733909A (en) * 2020-12-31 2021-04-30 北京软通智慧城市科技有限公司 Duplicate removal identification method, device, medium and electronic equipment for urban cases

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘晨;李兵;吴卫星;: "基于罪名相关成分标注的刑事裁判文书概要信息提取", 山东科技大学学报(自然科学版), no. 04, 25 June 2018 (2018-06-25) *

Similar Documents

Publication Publication Date Title
CN107437038B (en) Webpage tampering detection method and device
WO2018050022A1 (en) Application program recommendation method, and server
CN107862070B (en) Online classroom discussion short text instant grouping method and system based on text clustering
CN109145216A (en) Network public-opinion monitoring method, device and storage medium
CN110851598B (en) Text classification method and device, terminal equipment and storage medium
CN111177367B (en) Case classification method, classification model training method and related products
CN113051362A (en) Data query method and device and server
CN113486664A (en) Text data visualization analysis method, device, equipment and storage medium
CN113628043B (en) Complaint validity judging method, device, equipment and medium based on data classification
US11568344B2 (en) Systems and methods for automated pattern detection in service tickets
CN111046087A (en) Data processing method, device, equipment and storage medium
CN112579781B (en) Text classification method, device, electronic equipment and medium
CN117313058A (en) Information identification method, apparatus, computer device and storage medium
CN109886318B (en) Information processing method and device and computer readable storage medium
CN109918638B (en) Network data monitoring method
CN116578700A (en) Log classification method, log classification device, equipment and medium
CN114817518B (en) License handling method, system and medium based on big data archive identification
CN114090798A (en) Text duplicate removal method and device, computer storage medium and electronic equipment
CN116303379A (en) Data processing method, system and computer storage medium
CN115051859A (en) Information analysis method, information analysis device, electronic apparatus, and medium
CN114003737A (en) Double-record examination assisting method, device, equipment and medium based on artificial intelligence
CN114579711A (en) Method, device, equipment and storage medium for identifying fraud application program
CN113449506A (en) Data detection method, device and equipment and readable storage medium
CN114676428A (en) Application program malicious behavior detection method and device based on dynamic characteristics
CN112988972A (en) Administrative penalty file evaluation and checking method and system based on data model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230801

Address after: Room 404-405, 504, Building B-17-1, Big data Industrial Park, Kecheng Street, Yannan High tech Zone, Yancheng, Jiangsu Province, 224000

Applicant after: Yancheng Tianyanchawei Technology Co.,Ltd.

Address before: 224000 room 501-503, building b-17-1, Xuehai road big data Industrial Park, Kecheng street, Yannan high tech Zone, Yancheng City, Jiangsu Province (CNK)

Applicant before: Yancheng Jindi Technology Co.,Ltd.

TA01 Transfer of patent application right