CN110909824A - Test data checking method and device, storage medium and electronic equipment - Google Patents

Test data checking method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN110909824A
CN110909824A CN201911252590.6A CN201911252590A CN110909824A CN 110909824 A CN110909824 A CN 110909824A CN 201911252590 A CN201911252590 A CN 201911252590A CN 110909824 A CN110909824 A CN 110909824A
Authority
CN
China
Prior art keywords
test
clustering
cluster
sequence
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911252590.6A
Other languages
Chinese (zh)
Other versions
CN110909824B (en
Inventor
***
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Xinkai Life Technology Co Ltd
Tianjin Happy Life Technology Co Ltd
Original Assignee
Tianjin Xinkai Life Technology Co Ltd
Tianjin Happy Life Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Xinkai Life Technology Co Ltd, Tianjin Happy Life Technology Co Ltd filed Critical Tianjin Xinkai Life Technology Co Ltd
Priority to CN201911252590.6A priority Critical patent/CN110909824B/en
Publication of CN110909824A publication Critical patent/CN110909824A/en
Application granted granted Critical
Publication of CN110909824B publication Critical patent/CN110909824B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/20ICT specially adapted for the handling or processing of patient-related medical or healthcare data for electronic clinical trials or questionnaires

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to the technical field of data processing, in particular to a test data checking method, a test data checking device, a computer-readable storage medium and an electronic device, wherein the method comprises the following steps: preprocessing test data of each test object in a target test to generate a sequence vector corresponding to each test object; performing clustering analysis on each sequence vector to obtain a clustering result; the clustering result comprises N clustering clusters, wherein N is a positive integer; and extracting target samples from the sequence vectors contained in the N clustering clusters according to a preset sampling rule so as to check the test data of the test object corresponding to the target samples. The technical scheme of the embodiment of the disclosure can accurately represent the test sequence of a test object, and can reduce the workload of artificial checking and improve the checking efficiency on the premise of controllable checking accuracy.

Description

Test data checking method and device, storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a test data checking method, a test data checking device, a computer-readable storage medium, and an electronic device.
Background
In the modern society with rapid technological development, various fields are continuously innovated and developed. In the field of medicine, more and more new drugs are also being prepared. In order to ensure the safety and efficacy of a new drug, clinical trials are often required first when a new drug is prepared. In the course of conducting clinical trials, there are many factors that affect the quality of the test data. For this reason, before a new drug is evaluated based on test data, the test data needs to be checked.
At present, when test data of a clinical test is checked, it is usually required to check missing values, abnormal values and time ranges in the test data according to an electronic data acquisition system (EDC system) of the clinical test, and then check a test sequence of a test object manually. However, the manual inspection of the test sequence requires a lot of labor and time, which results in low inspection efficiency and seriously affects the speed of new drug evaluation.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The present disclosure is directed to a method for checking test data, a device for checking test data, a computer-readable storage medium, and an electronic device, so as to overcome a problem of low checking efficiency when checking a test sequence of test data at least to a certain extent.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to a first aspect of the present disclosure, there is provided a method for checking test data, including:
preprocessing test data of each test object in a target test to generate a sequence vector corresponding to each test object;
performing clustering analysis on each sequence vector to obtain a clustering result; the clustering result comprises N clustering clusters, wherein N is a positive integer;
and extracting target samples from the sequence vectors contained in the N clustering clusters according to a preset sampling rule so as to check the test data of the test object corresponding to the target samples.
Optionally, based on the foregoing scheme, the preset sampling rule includes a preset sorting rule;
the extracting the target sample from the sequence vectors contained in the N clustering clusters according to the preset sampling rule comprises the following steps:
sequencing the N clustering clusters according to a preset sequencing rule to obtain a clustering cluster sequence;
and extracting the sequence vector in each cluster according to the sequence of each cluster in the cluster sequence to obtain a target sample.
Optionally, based on the foregoing scheme, the preset sampling rule further includes a sampling rate calculation method corresponding to the preset sorting rule;
the extracting the sequence vector in each cluster according to the sequence of each cluster in the cluster sequence to obtain a target sample comprises:
determining the sampling rate corresponding to each clustering cluster according to the sampling rate calculation method based on the sequence of each clustering cluster in the clustering cluster sequence;
and extracting the sequential vectors in the corresponding clustering clusters according to the sampling rate to obtain target samples.
Optionally, based on the foregoing scheme, the sample rate calculation method includes:
acquiring the ranking n of a target cluster in the cluster sequence; wherein n is a positive integer;
and calculating the ratio of N-1 to N and determining the sampling rate of the target cluster.
Optionally, based on the foregoing scheme, the preset sorting rule includes sorting according to similarity from large to small;
the sorting the N clustering clusters according to a preset sorting rule to obtain a clustering cluster sequence comprises:
taking the cluster with the largest number of sequential vectors in the N clusters as a reference cluster, and calculating the similarity between the remaining N-1 clusters and the reference cluster;
and sequencing the remaining N-1 clustering clusters according to the sequence of the similarity from large to small by taking the reference cluster as a first bit to obtain a clustering cluster sequence.
Optionally, based on the foregoing scheme, the preprocessing the test data of each test object in the target test to generate a sequence vector corresponding to each test object includes:
establishing one-to-one corresponding project vectors according to test projects contained in the target test;
and converting the test data corresponding to each test object into a sequential vector according to the item vector so as to generate the sequential vector corresponding to each test object.
Optionally, based on the foregoing scheme, the establishing a one-to-one corresponding item vector according to the test items included in the target test includes:
acquiring a corresponding preset key item according to a test item contained in a target test;
and establishing item vectors corresponding to the preset key items one by one according to the quantity of the preset key items.
Optionally, based on the foregoing scheme, performing cluster analysis on each of the sequential vectors to obtain a clustering result includes:
and carrying out clustering analysis on each sequence vector according to a clustering algorithm based on a clustering core to obtain a clustering result.
According to a second aspect of the present disclosure, there is provided a device for checking test data, including:
the vector generation module is used for preprocessing the test data of each test object in the target test to generate a sequence vector corresponding to each test object;
the vector clustering module is used for carrying out clustering analysis on each sequential vector to obtain a clustering result; the clustering result comprises N clustering clusters, wherein N is a positive integer;
and the vector sampling module is used for extracting a target sample from the sequential vectors contained in the N clustering clusters according to a preset sampling rule so as to check the test data of the test object corresponding to the target sample.
According to a third aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of checking test data as described in any one of the above.
According to a fourth aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:
a processor; and
a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a method of auditing experimental data as described in any preceding claim.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
according to the method for checking the test data provided by the embodiment of the disclosure, the test data of each test object in the target test is preprocessed to generate the sequence vector corresponding to each test object, then, the sequence vectors are subjected to clustering analysis to obtain a clustering result, and then, according to a preset sampling rule, a target sample is extracted from the sequence vectors contained in the N clustering clusters to check the test data of the test object corresponding to the target sample. According to the technical scheme of the embodiment of the disclosure, on one hand, the test data of the test object is represented through the sequence vector, and the test sequence of one test object can be accurately represented so as to be convenient for checking the test sequence; on the other hand, according to the preset sampling rule, the target sample is extracted from the clustering result, and then the target sample is checked, so that the workload of artificial checking can be reduced and the checking efficiency can be improved on the premise that the checking accuracy is controllable.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:
FIG. 1 schematically illustrates a flow chart of a method of auditing experimental data in an exemplary embodiment of the disclosure;
FIG. 2 is a flow chart that schematically illustrates a method for preprocessing test data of each test object in a target test to generate a sequence vector corresponding to each test object in an exemplary embodiment of the present disclosure;
FIG. 3 schematically illustrates a flow chart of a method for establishing a one-to-one correspondence item vector according to test items included in a target test in an exemplary embodiment of the present disclosure;
fig. 4 is a flow chart schematically illustrating a method for extracting a target sample from sequential vectors contained in N of the cluster clusters according to a preset sampling rule in an exemplary embodiment of the present disclosure;
fig. 5 is a flowchart schematically illustrating a method for sorting N cluster clusters according to a preset sorting rule to obtain a cluster sequence in an exemplary embodiment of the disclosure;
fig. 6 is a flowchart schematically illustrating a method for extracting the order vector from each of the clusters to obtain a target sample according to the ordering of each of the clusters in the cluster sequence in an exemplary embodiment of the disclosure;
FIG. 7 schematically illustrates a flow chart of a sample rate calculation method in an exemplary embodiment of the present disclosure;
FIG. 8 is a schematic diagram illustrating the components of an apparatus for checking test data according to an exemplary embodiment of the present disclosure;
fig. 9 schematically shows a schematic structural diagram of a computer system of an electronic device suitable for implementing an exemplary embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
In the present exemplary embodiment, a method for checking test data is provided first, and may be applied to a process of checking a test sequence of test data. For example, in the medical field, the test sequence corresponding to the test data of a clinical test for a new drug is checked. Referring to fig. 1, the method for checking the test data may include the steps of:
s110, preprocessing test data of each test object in a target test to generate a sequence vector corresponding to each test object;
s120, performing clustering analysis on each sequence vector to obtain a clustering result; the clustering result comprises N clustering clusters, wherein N is a positive integer;
s130, sequencing the N clustering clusters according to a preset rule to obtain a clustering cluster sequence;
s140, sampling the sequence vectors in each cluster according to the sequence of the N clusters in the cluster sequence to obtain a target sample.
According to the method for checking the test data provided in the exemplary embodiment, on one hand, the test data of the test object is characterized by the sequence vector, and the test sequence of one test object can be accurately represented, so that the test sequence can be checked conveniently; on the other hand, according to the preset sampling rule, the target sample is extracted from the clustering result, and then the target sample is checked, so that the workload of artificial checking can be reduced and the checking efficiency can be improved on the premise that the checking accuracy is controllable.
Hereinafter, each step of the method for checking test data in the present exemplary embodiment will be described in more detail with reference to the drawings and the embodiments.
Step S110 is to preprocess test data of each test object in the target test to generate a sequence vector corresponding to each test object.
In an example embodiment of the present disclosure, the target test pointer is a test performed on at least one test object, the test process may include at least one test item arranged in a sequence, and the test data may include a test object ID, a test item, and an item time. For example, a clinical trial for a new drug includes administering to at least one subject every 7 days, a blood test every 10 days, a physical condition test every 16 days; the corresponding test process comprises taking the medicine to the test object on day 1, taking the medicine to the test object on day 8, performing blood detection on the test object on day 10, taking the medicine to the test object on day 15, and performing body basic condition detection on the test object on day 16 … …; the corresponding test data for test subject a may then include a first dose time of a, a second dose time of a, a third dose time of a, a first blood test time of a, a first body condition test time of a … ….
In addition, the test data can also comprise a test result corresponding to each test item, so that the test result can be directly called during checking and the target test can be evaluated according to the test result.
In an exemplary embodiment of the present disclosure, preprocessing test data of each test object in a target test to generate a sequence vector corresponding to each test object, as shown in fig. 2, includes the following steps S210 to S220:
step S210, establishing one-to-one corresponding item vectors according to the test items contained in the target test.
In an example embodiment of the present disclosure, a project vector corresponding to a test project included in a target test may be established first. Specifically, referring to fig. 3, establishing a one-to-one corresponding item vector according to the test items included in the target test includes the following steps S310 to S320:
step S310, acquiring a corresponding preset key item according to a test item contained in the target test.
In an example embodiment of the present disclosure, in order to be able to represent the overall test order of the target test by a vector form, it is first necessary to extract preset key items existing in the target test. Wherein the preset key items comprise at least one item required to be performed in the test. It should be noted that, when the preset key items are obtained, some test items with similar test time and the same test type may be merged to be used as one preset key item. For example, a clinical trial for a new drug includes administering to at least one subject every 7 days, examining white blood cell count, hemoglobin, neutrophil count, platelet count every 10 days, and examining body basic conditions every 16 days. Accordingly, since the items such as the examination such as the white blood cell count, the hemoglobin, the neutrophil count, and the platelet count are all the items for blood examination, and the examination time is the same, they can be combined into one blood examination item. At this time, the preset key items for the new medicine may include a medication item, a blood test item, and a physical examination item.
Step S320, establishing item vectors corresponding to the preset key items one to one according to the number of the preset key items.
In an example embodiment of the present disclosure, a project vector corresponding to each preset key project may be established, so as to represent the overall sequence of the target experiment by the project vector. In order to avoid the duplication of the project vectors, the dimensions of the project vectors can be determined according to the number of the preset key projects, and then the project vectors corresponding to the preset key projects one by one are configured. For example, the item vector may be configured in an arrangement of 0 and 1 in a binary system according to an order of first occurrence of the preset key items in the target experiment. For example, the first appearance sequence of the key items in the target test is preset as a medication item, a blood examination item and a physical examination item. The corresponding item vectors are [0,0,1], [0,1,0], [1,0,0] respectively. In addition, the configuration of the project vector may also be configured according to other configuration rules, which are not particularly limited in this disclosure.
Step S220, converting the test data corresponding to each test object into a sequential vector according to the item vector, so as to generate a sequential vector corresponding to each test object.
In an example embodiment of the present disclosure, a sequence vector corresponding to a test object may be formed by converting test data corresponding to the test object into item vectors arranged in a sequential order. For example, in the above embodiment, assuming that the subjects B are all participating in the clinical trial at the standard time, the trial data of the subjects B includes administration to the subjects on day 1, administration to the subjects on day 8, blood test to the subjects on day 10, administration to the subjects on day 15, and body basic condition test … … to the subjects on day 16, and the sequence vectors are: [ [0,0,1], [0,0,1], [0,1,0], [0,0,1], [1,0,0] … … ].
Since the missing values, abnormal values, time ranges and the like in the test data can be checked by the EDC system, the test data checked by the EDC are all complete test data. At this moment, the sequence of the preset key items of a certain test object in the test process can be accurately reflected by the item vectors according to the corresponding arrangement mode of the test data, so that the test sequence of the test object can be conveniently checked.
Step S120, performing clustering analysis on each sequence vector to obtain a clustering result.
In an exemplary embodiment of the present disclosure, since various test subjects may not participate in the test item or participate in the test item incorrectly during the process of performing the target test, the test data may include a variety of test sequences. Correspondingly, there are many sequential vectors, and the final clustering result may include N clustering clusters. Wherein N is a positive integer.
In an example embodiment of the present disclosure, performing a cluster analysis on each of the sequential vectors to obtain a clustering result may include: and carrying out clustering analysis on each sequence vector according to a clustering algorithm based on a clustering core to obtain a clustering result.
In an example embodiment of the present disclosure, the sequence vectors corresponding to all test objects of the target test may be subjected to cluster analysis according to a clustering algorithm based on a clustering core, so as to cluster the sequence vectors corresponding to the test data with the same test sequence into the same cluster. It should be noted that, the order vectors may also be clustered according to other clustering algorithms, and the category of the clustering algorithm is not particularly limited in the present disclosure.
The wrong test sequence is multiplied when there are more test items in the target test or when there are more repetitions of test items. At the moment, by carrying out clustering analysis on the sequence vectors, the sequence vectors can be properly clustered on the premise of not defining the wrong sequence, so that the time consumed by defining the wrong test sequence is avoided, and the efficiency of checking the test sequence is improved.
Step S130, extracting target samples from the sequence vectors contained in the N clustering clusters according to a preset sampling rule, so as to check the test data of the test object corresponding to the target samples.
In an example embodiment of the present disclosure, after the target sample is obtained, since the target sample is a sequential vector, test data of a corresponding test object may be extracted according to the sequential vector to check the test sequence, so as to determine whether the test sequence is a wrong test sequence according to the test data.
In an example embodiment of the present disclosure, the preset sampling rule may include a preset sorting rule. The preset sorting rule may include a rule for sorting the cluster according to the attribute of the cluster. In general, the attributes of the cluster may include attributes related to an error rate of an error occurring in a sequence vector in the cluster, for example, the cluster may be directly sorted in a sequence of the error rate from small to large, and other attributes related to the error rate may also be used as a sorting basis, which is not particularly limited in this disclosure.
In an exemplary embodiment of the present disclosure, referring to fig. 4, extracting a target sample from sequential vectors contained in N of the cluster clusters according to a preset sampling rule includes the following steps S410 to S420:
and S410, sequencing the N clustering clusters according to a preset sequencing rule to obtain a clustering cluster sequence.
In an example embodiment of the present disclosure, since the test order of most test data is correct when the target test is performed in a normal condition, the order vectors corresponding to the test data with the correct test order will be clustered into the same cluster. In this case, this cluster can be used as a reference cluster, and the more similar the other clusters are to the reference cluster, the smaller the probability that the test sequence will be wrong. Therefore, the preset sort rule may be set to sort from large to small according to the similarity. In addition, because the relation between the clustering result of the sequence vector corresponding to different test data and the error rate may be different, a preset ordering rule may be set according to the specific situation of the test data to order the clustering clusters, which is not particularly limited by the present disclosure.
In an example embodiment of the present disclosure, when the preset ordering rule includes ordering from large to small according to the similarity, referring to fig. 5, the ordering N cluster clusters according to the preset ordering rule to obtain a cluster sequence includes the following steps S510 to S520:
step S510, taking the cluster with the largest number of sequential vectors in the N clusters as a reference cluster, and calculating the similarity between the remaining N-1 clusters and the reference cluster.
And S520, sequencing the remaining N-1 clustering clusters by taking the reference cluster as a first bit according to the sequence of the similarity from large to small so as to obtain a clustering cluster sequence.
In an example embodiment of the present disclosure, when cluster analysis is performed using a clustering algorithm based on clustering cores, the similarity between clustering clusters includes a distance between clustering cores corresponding to each clustering cluster, and thus a distance between a reference cluster and the remaining clustering clusters may be calculated as the similarity. And then, taking the reference class as the first bit, and sequencing the remaining cluster clusters according to the sequence of similarity from large to small. It should be noted that the similarity may also be characterized in other ways, and the disclosure is not limited thereto in particular. For example, in general, the more test data in the same test order, the smaller the probability of error occurrence, and therefore the number of sequence vectors in the remaining cluster clusters may be used as the similarity.
Step S420, extracting the sequence vector in each cluster according to the sequence of each cluster in the cluster sequence to obtain a target sample.
In an example embodiment of the present disclosure, since the preset ordering rules are different, the corresponding obtained cluster sequences are also different. Therefore, different sampling modes can be configured according to the preset sorting rule, and the preset sorting rule and the configuration of the corresponding sampling mode are not limited by the disclosure.
In an example embodiment of the present disclosure, the sampling check may be performed on different cluster clusters by configuring different sampling rates. For example, a higher sampling rate can be configured for a cluster with a higher error rate to extract more target samples for checking, so that the checking accuracy is ensured; otherwise, a lower sampling rate can be set for the clustering cluster with a smaller error rate, so that the number of samples is reduced, and the checking efficiency is improved.
In an exemplary embodiment of the present disclosure, when the preset sampling rule further includes a sample rate calculation method corresponding to a preset ordering rule, referring to fig. 6, extracting the order vector in each of the clusters according to the ordering of each of the clusters in the cluster sequence to obtain a target sample includes the following steps S610 to S620:
step S610, based on the sequence of each cluster in the cluster sequence, determining the sampling rate corresponding to each cluster according to the sampling rate calculation method.
In an example embodiment of the present disclosure, because the preset ordering rules are different, the obtained cluster sequences are also different, and thus, the sample rate calculation methods corresponding to different cluster sequences are also different. In order to enable the sampling rate to be more matched with the clustering cluster, a corresponding sampling rate calculation rule can be set according to a preset sorting rule. For example, when the preset ordering rule is to order the cluster clusters according to the error rates of the cluster clusters from large to small, a decreasing sample rate calculation method may be set because the error rate corresponding to the cluster that is earlier in the cluster sequence is larger. By configuring different sampling rate calculation methods according to different preset sequencing rules, the appropriate sampling rate can be configured for each cluster, and then an appropriate amount of target samples are extracted for checking, so that the checking accuracy is ensured to a certain extent.
In an example embodiment of the present disclosure, when the preset ordering rule includes ordering from large to small according to the similarity, the corresponding sample rate calculating method, as shown in fig. 7, includes the following steps S710 to S720:
step S710, obtaining the ranking n of the target cluster in the cluster sequence.
And S720, calculating the ratio of N-1 to N, and configuring the ratio as the sampling rate of the target cluster.
In an example embodiment of the disclosure, by obtaining the ranking N of a target cluster in a cluster sequence and calculating the ratio of N-1 to N as the sampling rate corresponding to the target cluster, the sampling rate can be calculated according to the cluster ranking, and simultaneously, the condition that the greater the ranking is, the greater the sampling probability is, can be satisfied.
It should be noted that, because the preset sampling rules applicable to different test data are different, the preset sorting rules included in the preset sampling rules and the corresponding smoking rate calculation methods are also different, and therefore, the preset sorting rules included in the preset sampling rules and the corresponding sampling rate calculation methods are not particularly limited. For example, when the preset sorting rule includes sorting according to the similarity from small to large, the reference cluster is the last bit in the obtained cluster sorting, and the other cluster clusters are arranged according to the similarity from small to large. The above-described sampling rate calculation method using the ratio of N-1 to N as the sampling rate is not applicable at this time. The ratio of N-N +1 to N may be used as the sampling rate, or the sampling rate corresponding to the cluster may be calculated by other sampling rate calculation methods with increasing sampling rate.
And step 620, extracting the sequence vector in the corresponding cluster according to the sampling rate to obtain a target sample.
In an example embodiment of the present disclosure, when sampling the sequential vector, since sampling at different sampling rates can be performed for different clusters, the number of samples extracted from each cluster is also different. Specifically, more target samples are extracted from the clustering cluster with the high error rate, and less samples are extracted from the clustering cluster with the low error rate.
It is noted that the above-mentioned figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
In addition, in an exemplary embodiment of the present disclosure, a device for checking test data is also provided. Referring to fig. 8, the apparatus 800 for checking test data includes: a vector generation module 810, a vector clustering module 820, and a vector sampling module 830.
The vector generating module 810 may be configured to pre-process test data of each test object in a target test to generate a sequence vector corresponding to each test object; the vector clustering module 820 can be configured to perform cluster analysis on each sequential vector to obtain a clustering result; the clustering result comprises N clustering clusters, wherein N is a positive integer; the vector sampling module 830 may be configured to extract a target sample from sequential vectors included in the N clustering clusters according to a preset sampling rule, so as to check the test data of the test object corresponding to the target sample.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the vector sampling module 830 may be configured to sort the N clustering clusters according to a preset sorting rule to obtain a clustering cluster sequence; and extracting the sequence vector in each cluster according to the sequence of each cluster in the cluster sequence to obtain a target sample.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the vector sampling module 830 may be configured to determine, based on the ordering of each of the clusters in the cluster sequence, a sampling rate corresponding to each of the clusters according to the sample rate calculation method; and extracting the sequential vectors in the corresponding clustering clusters according to the sampling rate to obtain target samples.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the vector sampling module 830 may be configured to obtain a rank n of a target cluster in the cluster sequence; wherein n is a positive integer; and calculating the ratio of N-1 to N and determining the sampling rate of the target cluster.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the vector sampling module 830 may be configured to use, as a reference cluster, a cluster including the largest number of sequential vectors from among the N cluster clusters, and calculate similarities between the remaining N-1 cluster clusters and the reference cluster; and sequencing the remaining N-1 clustering clusters according to the sequence of the similarity from large to small by taking the reference cluster as a first bit to obtain a clustering cluster sequence.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the vector generation module 810 may be configured to establish a one-to-one corresponding item vector according to a test item included in a target test; and converting the test data corresponding to each test object into a sequential vector according to the item vector so as to generate the sequential vector corresponding to each test object.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the vector generation module 810 may be configured to obtain a corresponding preset key item according to a test item included in a target test; and establishing item vectors corresponding to the preset key items one by one according to the quantity of the preset key items.
In an exemplary embodiment of the present disclosure, based on the foregoing solution, the vector clustering module 820 may be configured to perform cluster analysis on each sequential vector according to a clustering algorithm based on a clustering core to obtain a clustering result.
For details which are not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method for checking test data of the present disclosure for the details which are not disclosed in the embodiments of the apparatus of the present disclosure.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the method for checking test data is also provided.
As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
An electronic device 900 according to such an embodiment of the disclosure is described below with reference to fig. 9. The electronic device 900 shown in fig. 9 is only an example and should not bring any limitations to the functionality or scope of use of the embodiments of the present disclosure.
As shown in fig. 9, the electronic device 900 is embodied in the form of a general purpose computing device. Components of electronic device 900 may include, but are not limited to: the at least one processing unit 910, the at least one storage unit 920, a bus 930 connecting different system components (including the storage unit 920 and the processing unit 910), and a display unit 940.
Wherein the storage unit stores program code that is executable by the processing unit 910 to cause the processing unit 910 to perform steps according to various exemplary embodiments of the present disclosure described in the above section "exemplary method" of the present specification. For example, the processing unit 910 may execute step S110 as shown in fig. 1: preprocessing test data of each test object in a target test to generate a sequence vector corresponding to each test object; s120: performing clustering analysis on each sequence vector to obtain a clustering result; the clustering result comprises N clustering clusters, wherein N is a positive integer; s130: and extracting target samples from the sequence vectors contained in the N clustering clusters according to a preset sampling rule so as to check the test data of the test object corresponding to the target samples.
As another example, the electronic device may implement the steps shown in fig. 2 to 7.
The storage unit 920 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)921 and/or a cache memory unit 922, and may further include a read only memory unit (ROM) 923.
Storage unit 920 may also include a program/utility 924 having a set (at least one) of program modules 925, such program modules 925 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 930 can be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 900 may also communicate with one or more external devices 970 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 900, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 900 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interface 950. Also, the electronic device 900 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 960. As shown, the network adapter 960 communicates with the other modules of the electronic device 900 via the bus 930. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 900, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the present disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present disclosure described in the "exemplary methods" section above of this specification, when the program product is run on the terminal device.
Furthermore, an exemplary embodiment of the present disclosure provides a program product for implementing the above method, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
Furthermore, the above-described figures are merely schematic illustrations of processes included in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims (11)

1. A method for checking test data is characterized by comprising the following steps:
preprocessing test data of each test object in a target test to generate a sequence vector corresponding to each test object;
performing clustering analysis on each sequence vector to obtain a clustering result; the clustering result comprises N clustering clusters, wherein N is a positive integer;
and extracting target samples from the sequence vectors contained in the N clustering clusters according to a preset sampling rule so as to check the test data of the test object corresponding to the target samples.
2. The method of claim 1, wherein the predetermined sampling rule comprises a predetermined ordering rule;
the extracting the target sample from the sequence vectors contained in the N clustering clusters according to the preset sampling rule comprises the following steps:
sequencing the N clustering clusters according to a preset sequencing rule to obtain a clustering cluster sequence;
and extracting the sequence vector in each cluster according to the sequence of each cluster in the cluster sequence to obtain a target sample.
3. The method according to claim 2, wherein the preset sampling rule further comprises a sampling rate calculation method corresponding to the preset sorting rule;
the extracting the sequence vector in each cluster according to the sequence of each cluster in the cluster sequence to obtain a target sample comprises:
determining the sampling rate corresponding to each clustering cluster according to the sampling rate calculation method based on the sequence of each clustering cluster in the clustering cluster sequence;
and extracting the sequential vectors in the corresponding clustering clusters according to the sampling rate to obtain target samples.
4. The method according to claim 3, wherein the sample rate calculation method comprises:
acquiring the ranking n of a target cluster in the cluster sequence; wherein n is a positive integer;
and calculating the ratio of N-1 to N and determining the sampling rate of the target cluster.
5. The method of claim 2, wherein the preset ordering rules include ordering from big to small according to similarity;
the sorting the N clustering clusters according to a preset sorting rule to obtain a clustering cluster sequence comprises:
taking the cluster with the largest number of sequential vectors in the N clusters as a reference cluster, and calculating the similarity between the remaining N-1 clusters and the reference cluster;
and sequencing the remaining N-1 clustering clusters according to the sequence of the similarity from large to small by taking the reference cluster as a first bit to obtain a clustering cluster sequence.
6. The method of claim 1, wherein preprocessing the test data of each test object in the target test to generate a sequential vector corresponding to each test object comprises:
establishing one-to-one corresponding project vectors according to test projects contained in the target test;
and converting the test data corresponding to each test object into a sequential vector according to the item vector so as to generate the sequential vector corresponding to each test object.
7. The method of claim 6, wherein the establishing a one-to-one correspondence item vector according to the test items included in the target test comprises:
acquiring a corresponding preset key item according to a test item contained in a target test;
and establishing item vectors corresponding to the preset key items one by one according to the quantity of the preset key items.
8. The method of claim 1, wherein performing cluster analysis on each of the sequential vectors to obtain a cluster result comprises:
and carrying out clustering analysis on each sequence vector according to a clustering algorithm based on a clustering core to obtain a clustering result.
9. An apparatus for checking test data, comprising:
the vector generation module is used for preprocessing the test data of each test object in the target test to generate a sequence vector corresponding to each test object;
the vector clustering module is used for carrying out clustering analysis on each sequential vector to obtain a clustering result; the clustering result comprises N clustering clusters, wherein N is a positive integer;
and the vector sampling module is used for extracting a target sample from the sequential vectors contained in the N clustering clusters according to a preset sampling rule so as to check the test data of the test object corresponding to the target sample.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a method of checking test data according to any one of claims 1 to 8.
11. An electronic device, comprising:
a processor; and
memory for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out a method of auditing test data according to any one of claims 1 to 8.
CN201911252590.6A 2019-12-09 2019-12-09 Test data checking method and device, storage medium and electronic equipment Active CN110909824B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911252590.6A CN110909824B (en) 2019-12-09 2019-12-09 Test data checking method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911252590.6A CN110909824B (en) 2019-12-09 2019-12-09 Test data checking method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN110909824A true CN110909824A (en) 2020-03-24
CN110909824B CN110909824B (en) 2022-10-28

Family

ID=69823644

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911252590.6A Active CN110909824B (en) 2019-12-09 2019-12-09 Test data checking method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN110909824B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111738319A (en) * 2020-06-11 2020-10-02 佳都新太科技股份有限公司 Clustering result evaluation method and device based on large-scale samples
CN112508512A (en) * 2020-11-26 2021-03-16 国网河北省电力有限公司经济技术研究院 Power grid engineering cost data management method and device and terminal equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008027207A (en) * 2006-07-21 2008-02-07 Gunma Univ Retrieval system and retrieval method
CN104850594A (en) * 2015-04-27 2015-08-19 北京工业大学 Non-recursive clustering algorithm based on quicksort (NR-CAQS) suitable for large data
CN107194430A (en) * 2017-05-27 2017-09-22 北京三快在线科技有限公司 A kind of screening sample method and device, electronic equipment
CN108764319A (en) * 2018-05-21 2018-11-06 北京京东尚科信息技术有限公司 A kind of sample classification method and apparatus
CN109473149A (en) * 2018-11-09 2019-03-15 天津开心生活科技有限公司 Data Quality Assessment Methodology, device, electronic equipment and computer-readable medium
CN109584980A (en) * 2018-11-09 2019-04-05 金色熊猫有限公司 Data verification method and device, electronic equipment, storage medium
CN110046586A (en) * 2019-04-19 2019-07-23 腾讯科技(深圳)有限公司 A kind of data processing method, equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008027207A (en) * 2006-07-21 2008-02-07 Gunma Univ Retrieval system and retrieval method
CN104850594A (en) * 2015-04-27 2015-08-19 北京工业大学 Non-recursive clustering algorithm based on quicksort (NR-CAQS) suitable for large data
CN107194430A (en) * 2017-05-27 2017-09-22 北京三快在线科技有限公司 A kind of screening sample method and device, electronic equipment
CN108764319A (en) * 2018-05-21 2018-11-06 北京京东尚科信息技术有限公司 A kind of sample classification method and apparatus
CN109473149A (en) * 2018-11-09 2019-03-15 天津开心生活科技有限公司 Data Quality Assessment Methodology, device, electronic equipment and computer-readable medium
CN109584980A (en) * 2018-11-09 2019-04-05 金色熊猫有限公司 Data verification method and device, electronic equipment, storage medium
CN110046586A (en) * 2019-04-19 2019-07-23 腾讯科技(深圳)有限公司 A kind of data processing method, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郑芳泉等: "基于支持向量聚类和重抽样的入侵检测", 《福州大学学报(自然科学版)》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111738319A (en) * 2020-06-11 2020-10-02 佳都新太科技股份有限公司 Clustering result evaluation method and device based on large-scale samples
CN112508512A (en) * 2020-11-26 2021-03-16 国网河北省电力有限公司经济技术研究院 Power grid engineering cost data management method and device and terminal equipment
CN112508512B (en) * 2020-11-26 2022-09-09 国网河北省电力有限公司经济技术研究院 Power grid engineering cost data management method and device and terminal equipment

Also Published As

Publication number Publication date
CN110909824B (en) 2022-10-28

Similar Documents

Publication Publication Date Title
CN108091372B (en) Medical field mapping verification method and device
CN106933806A (en) The determination method and apparatus of medical synonym
CN109524070B (en) Data processing method and device, electronic equipment and storage medium
CN109191451B (en) Abnormality detection method, apparatus, device, and medium
CN108776696B (en) Node configuration method and device, storage medium and electronic equipment
CN111564223B (en) Infectious disease survival probability prediction method, and prediction model training method and device
CN113345577B (en) Diagnosis and treatment auxiliary information generation method, model training method, device, equipment and storage medium
CN108733712B (en) Question-answering system evaluation method and device
CN111090641A (en) Data processing method and device, electronic equipment and storage medium
CN113593709B (en) Disease coding method, system, readable storage medium and device
CN110909824B (en) Test data checking method and device, storage medium and electronic equipment
CN111832298A (en) Quality inspection method, device and equipment for medical records and storage medium
CN110471941B (en) Method and device for automatically positioning judgment basis and electronic equipment
CN116578704A (en) Text emotion classification method, device, equipment and computer readable medium
CN111640517B (en) Medical record coding method and device, storage medium and electronic equipment
CN111125311B (en) Method and device for normalization processing of inspection information, storage medium and electronic equipment
CN111785340A (en) Medical data processing method, device, equipment and storage medium
CN111523309A (en) Medicine information normalization method and device, storage medium and electronic equipment
CN116150690A (en) DRGs decision tree construction method and device, electronic equipment and storage medium
CN111063445A (en) Feature extraction method, device, equipment and medium based on medical data
CN110826616A (en) Information processing method and device, electronic equipment and storage medium
CN112766779B (en) Information processing method, computer device, and storage medium
CN110931136B (en) Event searching method and device, computer medium and electronic equipment
CN114566280A (en) User state prediction method and device, electronic equipment and storage medium
CN114461085A (en) Medical input recommendation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant