CN111079940B - Decision tree model establishing method and using method for real-time fake-licensed car analysis - Google Patents

Decision tree model establishing method and using method for real-time fake-licensed car analysis Download PDF

Info

Publication number
CN111079940B
CN111079940B CN201911196978.9A CN201911196978A CN111079940B CN 111079940 B CN111079940 B CN 111079940B CN 201911196978 A CN201911196978 A CN 201911196978A CN 111079940 B CN111079940 B CN 111079940B
Authority
CN
China
Prior art keywords
data
time
vehicle
fake
real
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911196978.9A
Other languages
Chinese (zh)
Other versions
CN111079940A (en
Inventor
杨光
贺珊
张龙涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Fiberhome Digtal Technology Co Ltd
Original Assignee
Wuhan Fiberhome Digtal Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Fiberhome Digtal Technology Co Ltd filed Critical Wuhan Fiberhome Digtal Technology Co Ltd
Priority to CN201911196978.9A priority Critical patent/CN111079940B/en
Publication of CN111079940A publication Critical patent/CN111079940A/en
Application granted granted Critical
Publication of CN111079940B publication Critical patent/CN111079940B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Databases & Information Systems (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Educational Administration (AREA)
  • General Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a decision tree model establishing method for real-time fake-licensed vehicle analysis, which is applied to the technical field of fake-licensed vehicle real-time analysis model establishment, S11, a preparation step of training a data set and a verification data set, S12, establishment of a decision tree model and a decision tree model using method for real-time fake-licensed vehicle analysis. By applying the embodiment of the invention, the real-time vehicle passing data is analyzed through the established decision tree model, and the fake-licensed vehicle data meeting the conditions is pushed to alarm, so that the real-time analysis of the fake-licensed vehicle is realized.

Description

Decision tree model establishing method and using method for real-time fake-licensed car analysis
Technical Field
The invention relates to the technical field of vehicle fake-licensed analysis models, in particular to a decision tree model establishing method and a use method for real-time fake-licensed vehicle analysis.
Background
The fake-licensed vehicle refers to a real license plate, and fake license plates with the same number are sleeved on other vehicles, so that illegal vehicles are covered with legal coats on the surfaces, the fake-licensed vehicle belongs to illegal vehicles, the illegal vehicles are difficult to recognize by a traffic police in the driving process, and the illegal vehicles can be automatically analyzed only by means of technical means.
Currently, most of fake-licensed vehicle analyses are performed according to comparison analysis in the aspects of appearance time, appearance place, body color, license plate color, vehicle style and the like of vehicles with the same license plate number, sometimes even depending on vehicle registration information of a vehicle management department, and in the actual process, when a vehicle runs, license plate number acquisition, running time and running place of the vehicle are performed through a snapshot device arranged at each point (the place of each snapshot device is fixed, so that the corresponding longitude and latitude can be roughly obtained when the place where the vehicle is snapshot is in the monitoring range of the snapshot device), while vehicle information (such as vehicle brand, appearance parameters and the like) corresponding to the license plate number cannot be obtained, so that the existing analysis-based process is often limited by comparison data sources and cannot perform real-time analysis; for another example, the space-time point location model adopted for analyzing the fake-licensed vehicle also has the problem that false alarm of vehicle turning caused by the proximity of equipment point locations cannot be eliminated. The above problems also increase the difficulty of real-time fake-licensed car analysis and reduce the accuracy of fake-licensed car analysis from the side.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a decision tree model establishing method and a using method for real-time fake-licensed vehicle analysis, aiming at analyzing real-time vehicle passing data through the established decision tree model, pushing and alarming the fake-licensed vehicle data meeting the conditions, realizing the real-time analysis of the fake-licensed vehicles, and simultaneously realizing the filtration of the vehicle turning situation by comparing the real-time data with multiple time and space, reducing the false alarm probability caused by the fake-licensed vehicles and improving the analysis accuracy rate by the adopted decision tree model.
The invention is realized by the following steps:
the embodiment of the invention discloses a decision tree model building method for real-time fake-licensed car analysis, which comprises the following steps:
s11, preparing a training data set and a verification data set;
obtaining fake-licensed car data appearing in a historical database, obtaining non-fake-licensed car data in a time range corresponding to the appearing time according to the appearing time of the fake-licensed cars, and obtaining first five-dimensional vector data of the fake-licensed cars relevant to corresponding real cars based on the fake-licensed car data and the non-fake-licensed car data, wherein the first five-dimensional vector data corresponding to any one car license number comprises the following steps: license plate number, fake-licensed vehicle occurrence time, fake-licensed vehicle occurrence place, real vehicle occurrence time and real vehicle occurrence place; and acquiring real vehicle data appearing in the historical database, and acquiring second five-dimensional vector data consisting of a plurality of real vehicle data according to the time and place of each real vehicle, wherein the second five-dimensional vector data comprises: license plate number, first time when a real vehicle appears, first place when the real vehicle appears, second time when the real vehicle appears, and second place when the real vehicle appears;
obtaining three-dimensional vector data corresponding to each license plate number based on the first five-dimensional vector data and the second five-dimensional vector data, wherein the three-dimensional vector data comprises: license plate number, time difference of vehicle appearance, and distance of vehicle appearance;
taking the three-dimensional vector data as a sample of a training data set and a sample of a testing data set;
s12, constructing a decision tree model;
according to each license plate number in the three-dimensional data of the training data set, respectively taking the time difference of the real vehicle and the fake-licensed vehicle and the distance of the real vehicle and the fake-licensed vehicle as characteristics, and calculating the corresponding information gain;
according to the information gain corresponding to each license plate number, a root node and a leaf node are constructed to form a preliminary decision tree model;
and verifying and pruning the preliminary decision tree model according to the training data set to obtain the decision tree model.
In one implementation, the step of obtaining three-dimensional vector data corresponding to each license plate number based on the first five-dimensional vector data and the second five-dimensional vector data includes:
obtaining first three-dimensional data aiming at the license plate number based on the first five-dimensional vector data, wherein the first three-dimensional data comprises: license plate number, time difference between a real vehicle and a fake-licensed vehicle, and distance between the real vehicle and the fake-licensed vehicle; the second five-dimensional vector data obtain second three-dimensional data aiming at the license plate number, wherein the second three-dimensional data comprise: license plate number, time difference of real vehicle, distance of real vehicle;
combining the first three-dimensional data and the second three-dimensional data into three-dimensional vector data.
In one implementation, the formula used to calculate the information gain g (X, a) is expressed as:
g(X,A)=H(X)-H(X|A)
wherein, the first and the second end of the pipe are connected with each other,
Figure GDA0004016254630000031
h (X) is the entropy of the random variable, H (X | A) is the conditional entropy of the characteristic A, n is the number of values of the characteristic A, and p is i Is the probability distribution of the ith sample in the set; d represents a sample set of the respective features X, D i Represents a feature X i One subdivision of the inner K divisions, i.e. D i Represents a feature X i Sample set of (2), D ik Representing the set of samples in feature Xi that divides k.
In one implementation, the step of constructing a root node and a leaf node according to an information gain corresponding to each license plate number to form a preliminary decision tree model includes:
selecting the characteristic with the maximum information gain as a root node and the other characteristics as leaf nodes according to the information gain corresponding to each license plate number;
acquiring a root node and a leaf node corresponding to each feature;
and constructing a preliminary decision tree model based on the acquired root nodes and leaf nodes.
In one implementation, the step of verifying and pruning the preliminary decision tree model according to the training data set to obtain the decision tree model includes:
verifying the preliminary decision tree model through a training data set;
and pruning according to the verification result and a preset formula to obtain the decision tree model.
In one implementation, the preset formula is specifically expressed as:
Figure GDA0004016254630000032
wherein Ap and Aq respectively represent p partition and q partition of the feature A, S represents a test data set, model represents a decision tree Model, and if the Model after pruning is a precision Model (A) p S) and Model (A) for Model accuracy before pruning q And the ratio of S) is more than or equal to 1, the division after pruning is effective.
In one implementation, the three-dimensional vector data is embodied as:
Figure GDA0004016254630000041
where Δ t = | t i -t j |,i,j∈[1,n]
Wherein p represents the license plate number, m1 and m2 respectively represent the unique identifiers of two sample data, ti is the appearance time of the license plate number p at m1, tj is the appearance time of the license plate number p at m2, Δ t represents the time difference of the two sample data, Δ d represents the distance corresponding to the two appearance times of the license plate number p,
Figure GDA0004016254630000042
where EARTH _ RADIUS represents the RADIUS of the EARTH, lat i The longitude and latitude of the snapshot device corresponding to the time ti, lng j Is at t j And the longitude and latitude of the snapshot device corresponding to the time.
In addition, the invention also discloses a decision tree model using method for real-time fake-licensed car analysis, which comprises the following steps:
selecting the maximum division value of the time difference characteristics in the decision tree model as the length of a time window for acquiring real-time streaming data, acquiring the real-time streaming data by using spark streaming consumption Kafka, and dividing the real-time streaming data into RDD data sets with the maximum division value:
aggregating each RDD data set through license plate numbers, filtering data with consistent places, respectively calculating time difference values of each passing record of the same license plate, importing the time difference values and the snapshot equipment point location information of each record into a decision tree model as source data, and obtaining decision tree model analysis results, wherein the analysis results comprise license plate information meeting the conditions of the fake-licensed vehicles and the passing records of the license plate information.
The decision tree model is used, firstly, the maximum division value of the time difference characteristic is selected as the length of a time window for acquiring real-time streaming data, and the real-time streaming data is divided into RDD data sets with the maximum division value by consuming the streaming data acquired in real time: filtering data with the same place from each RDD data set through the license plate number, respectively calculating the time difference of each passing record of the same license plate, importing the time difference and the snapshot equipment point location information of each record into a decision tree model as source data, obtaining the place corresponding to the sample and the distance corresponding to the license plate number at any two times, and analyzing based on the time and the distance. In the embodiment of the invention, the training data containing the turning-around condition of the vehicle is used as the data of the non-fake-licensed vehicle when the constructed decision tree model is used, so that the condition that the vehicle is mistakenly reported as the fake-licensed vehicle due to turning-around of the vehicle in the analysis process is reduced, and the accuracy of analysis is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a schematic flow chart of a decision tree model building method for real-time fake-licensed vehicle analysis according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an application flow of a method for using a decision tree model for real-time fake-licensed car analysis according to an embodiment of the present invention;
fig. 3 is a schematic diagram of another application of the decision tree model using method for real-time fake-licensed vehicle analysis according to the embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
Referring to fig. 1, an embodiment of the present invention provides a decision tree model building method for real-time fake-licensed vehicle analysis, where the method includes:
s11, preparing a training data set and a verification data set.
It can be understood that the training data set and the verification data set are samples obtained by integrating historical data, and the specific implementation steps include: firstly, obtaining fake-licensed car data appearing in a historical database, obtaining non-fake-licensed car data in a time range corresponding to the appearing time according to the appearing time of the fake-licensed car, and obtaining first five-dimensional vector data of the fake-licensed car associated with a corresponding real car based on the fake-licensed car data and the non-fake-licensed car data, wherein the first five-dimensional vector data corresponding to any number of the car comprises the following steps: license plate number, fake-licensed car occurrence time, fake-licensed car occurrence place, real car occurrence time and real car occurrence place; and acquiring real vehicle data appearing in the historical database, and acquiring second five-dimensional vector data consisting of a plurality of real vehicle data according to the appearance time and place of each real vehicle, wherein the second five-dimensional vector data comprises: license plate number, first time when a real vehicle appears, first place when the real vehicle appears, second time when the real vehicle appears, and second place when the real vehicle appears.
In specific implementation, the historical data of the fake-licensed vehicle can be randomly selected as the data of the fake-licensed vehicle, and the real license plate data in the same time range (for example, when the time for obtaining the fake-licensed vehicle is t1, a time range is set, for example, the time range is t1-t2, and t1+ t 3) is selected as the data of the non-fake-licensed vehicle.
It should be noted that, the positions of the capturing devices are fixed, for example, numbers of the capturing devices are set, and the capturing device of each number corresponds to a longitude and latitude, so that the monitoring range is also a fixed adjacent area, and the longitude and latitude of the vehicle appearing in this area can be approximately replaced by the longitude and latitude of the device, so that the appearance time and the appearance place of the fake-licensed vehicle can be obtained according to the capturing device corresponding to each fake-licensed vehicle. Therefore, the occurrence time and the occurrence place of the fake-licensed cars can be obtained according to the corresponding snapshot device of each fake-licensed car. For the fake-licensed vehicle, if the historical data of the real vehicle is obtained within the specified time range, the occurrence time and the occurrence place of the real vehicle can be correspondingly obtained, so that a five-dimensional vector consisting of the 5 data comprising the license plate number, the occurrence time of the fake-licensed vehicle, the occurrence place of the fake-licensed vehicle, the occurrence time of the real vehicle and the occurrence place of the real vehicle can be formed for the license plate number.
According to the above-mentioned method for acquiring the time and place of the vehicle, the historical real vehicle appears at different places within a time range, so that the place of the real vehicle corresponding to two different times can be acquired, and therefore, the license plate number of the real vehicle, the time (twice) of each appearance, and the place of each appearance (two places corresponding to the two times of the appearance) are acquired to form a five-dimensional vector consisting of the five data.
Obtaining three-dimensional vector data corresponding to each license plate number based on the first five-dimensional vector data and the second five-dimensional vector data, wherein the three-dimensional vector data comprises: license plate number, time difference of vehicle appearance, distance of vehicle appearance.
The method comprises the following specific steps: obtaining first three-dimensional data aiming at the license plate number based on the first five-dimensional vector data, wherein the first three-dimensional data comprises: license plate number, time difference between a real vehicle and a fake-licensed vehicle, and distance between the real vehicle and the fake-licensed vehicle; the second five-dimensional vector data obtain second three-dimensional data aiming at the license plate number, wherein the second three-dimensional data comprise: license plate number, time difference of real vehicle, distance of real vehicle; combining the first three-dimensional data and the second three-dimensional data into three-dimensional vector data.
It should be noted that, for any one five-dimensional vector in the first five-dimensional vector data or the second five-dimensional vector data, the two times of occurrence are subtracted to obtain the time difference between the two times of occurrence; and correspondingly, the distance between the two appearance positions is calculated, so that the distance corresponding to the two appearance time of the vehicle can be obtained. In this way, the distance between two addresses where the vehicle appears in one time difference can be obtained. It can be understood that if the time difference is a short range, the distance between two vehicles is far, the travel distance of the vehicle can be obtained by multiplying the speed by the time according to the travel speed of the vehicle (the formal speed is a range), and if the formal distance is far from the distance between two addresses where the vehicle appears, the vehicle is represented by two vehicles (one vehicle is a fake-licensed vehicle), so that the law and characteristic parameters of the appearance of the vehicle are used as training data and are training samples of decision tree learning. Similarly, for a real vehicle, the relationship between the distance between two addresses of the corresponding vehicle and the occurrence time difference is the relationship between time and displacement which can be reached by the form speed of the normal vehicle, so that the form characteristic of the real vehicle and the form characteristic of the fake-licensed vehicle can be learned by a decision tree through a large number of sample learning.
Illustratively, the three-dimensional vector data constitutes a training data set H, expressed for any one sample as:
Figure GDA0004016254630000071
Δt=|t i -t j |,i,j∈[1,n]
wherein p represents a license plate, m1 and m2 respectively represent unique identifications of two pieces of original data, ti and tj represent the time of occurrence of a vehicle, Δ t represents the time difference (unit is second) of occurrence of two vehicles, and Δ d represents the spatial distance difference (unit is meter) corresponding to two times of occurrence of two vehicles.
Figure GDA0004016254630000072
Where EARTH _ RADIUS represents the RADIUS of the EARTH. 6371 km, lat and ng snapshot device latitude and longitude (or latitude and longitude corresponding to two vehicles).
And taking the three-dimensional vector data as a sample of a training data set and a sample of a testing data set.
And S12, constructing a decision tree model.
And according to each license plate number in the three-dimensional data of the training data set, respectively taking the time difference of the real vehicle and the fake-licensed vehicle and the distance of the real vehicle and the fake-licensed vehicle as characteristics, and calculating the corresponding information gain.
According to the analysis and principle of the fake-licensed vehicle, extracting time difference and distance from three-dimensional vector data in each training data set H as target characteristics, and constructing a decision tree by adopting a decision tree C.45 algorithm:
probability distribution of historical data in the training set:
P(X=x i )=p i ,i=1,2,…,n
wherein p represents that the sample X is X i Probability distribution case in the set, p i Is the probability distribution of the ith sample in the set.
First, the entropy H (X) of the random variable is calculated
Figure GDA0004016254630000081
X is a feature (time difference or distance difference), X i Is the ith sample, i is the first sample, where any sample t may also be represented.
Calculating the conditional entropy of the divided features A:
Figure GDA0004016254630000082
wherein n is the number of the characteristic A;
Figure GDA0004016254630000083
wherein D represents a sample set of the respective feature X, D i Represents a feature X i One subdivision of the inner K divisions, i.e. D i Represents a feature X i Sample set of (2), D ik A sample set representing a partition k in the feature Xi;
the information gain of the feature a is obtained as:
g(X,A)=H(X)-H(X|A)
the information gain ratio of feature a may also be calculated as:
Figure GDA0004016254630000091
for one feature, the feature with the largest information gain can be selected as a root node according to the calculation, and the other features are taken as leaf nodes.
And repeating the calculation process to obtain root nodes and leaf nodes corresponding to all the features and adding a decision tree model.
And (4) pruning a decision tree model, verifying through a reserved historical fake-licensed vehicle training data set, pruning the model according to the verification condition, and redefining and dividing.
Figure GDA0004016254630000092
Ap, aq represent the p-partition and q-partition of feature A, respectively, S represents the inspection dataset, model represents the decision tree Model, if the precision Model of the Model after pruning (A) is used p S) and Model (A) of the Model before pruning q And the ratio of S) is more than or equal to 1, the division after pruning is effective.
It should be noted that the C4.5 algorithm is a classical algorithm for generating a decision tree, and is an extension and optimization of the ID3 algorithm. The result of the C4.5 algorithm training is a classification model, which can be understood as a decision tree, the split attribute is a tree node, and the classification result is a tree node. Each node has a left sub-tree and a right sub-tree, and the node has no left and right sub-trees.
In addition, the invention also discloses a decision tree model using method for real-time fake-licensed car analysis, which comprises the following steps:
selecting the maximum division value of the time difference characteristics in the decision tree model as the length of a time window for acquiring real-time streaming data, acquiring the real-time streaming data by using spark streaming consumption Kafka, and dividing the real-time streaming data into RDD data sets with the maximum division value:
and aggregating each RDD data set through the license plate number, filtering out data with consistent places, respectively calculating the time difference of each passing record of the same license plate, and importing the time difference and the snapshot equipment point location information of each record into a decision tree model as source data to obtain the analysis result of the decision tree model, wherein the analysis result comprises the license plate information meeting the conditions of the fake-licensed vehicle and the passing record thereof.
As shown in fig. 2, after the training of the decision tree model is completed, in the actual sample analysis process, it is first determined whether the license plate numbers are the same, if so, time interval analysis is performed, and if not, the process is ended; the time interval may be divided into a plurality of time segments, for example, into 0-30, 30-60, 60-90, 90- + ∞, in a division example of 90 s; in the time period of 0-30, the distance of the vehicle is judged, the distance of 8km is further divided into 0-2k, 2k-4k, 4k-8k and 8k- + ∞, therefore, the time difference and the distance corresponding to any sample can be divided in the way, and finally, a definite judgment result is obtained. For example, license plate number AXXXXX, having the same license plate number and a time interval of 55s, and a distance of 1.5km, corresponds to a time interval of 60-90, a distance of 0-2k, and corresponds to an analysis result of yes, denoted as a fake-licensed car.
As shown in fig. 3, in the model after pre-pruning, leaf node yes represents that the fake-licensed vehicle is present, and no represents that the fake-licensed vehicle is not present, so that the feature division nodes are reduced in the model after pruning, and the efficiency is improved.
In a specific implementation mode, real-time vehicle passing data is analyzed according to the model through spark streaming. Selecting the maximum division value T seconds of the time difference characteristics in the decision tree model as the length of a time window for acquiring the real-time streaming data, acquiring the real-time streaming data by using spark streaming consumption Kafka, and dividing the real-time streaming data into RDD data sets with the length of T seconds: aggregating each RDD data set (in time sequence) through license plate numbers, filtering out data with consistent places, respectively calculating the time difference of each passing record of the same license plate, introducing the time difference and the point location information of the snapshot equipment of each record into a model as source data for analysis, pushing out the license plate information and the passing records thereof which accord with the conditions of the fake plate vehicles after analysis in batch by Kafka, and meanwhile persistently entering the license plate information and the passing records thereof into a database. Because training data containing the turning-around condition of the vehicle is used as non-fake-licensed vehicle data when a decision tree model is constructed, the condition that the vehicle is mistakenly reported as a fake-licensed vehicle due to turning-around of the vehicle in the analysis process is reduced, and the accuracy of analysis is improved. And the fake-licensed car alarm data in the platform consumption Kafka can be displayed to the user for user screening.
It should be noted that Kafka is an open source streaming platform developed by the Apache software foundation, written by Scala and Java. Kafka is a high-throughput distributed publish-subscribe messaging system that can handle all the action flow data in a consumer-scale website. This action (web browsing, searching and other user actions) is a key factor in many social functions on modern networks. These data are typically addressed by handling logs and log aggregations due to throughput requirements. This is a viable solution to the limitations of Hadoop-like log data and offline analysis systems, but which require real-time processing. The purpose of Kafka is to unify online and offline message processing through the Hadoop parallel load mechanism, and also to provide real-time messages through clustering.
Spark streaming is based on Spark streaming processing engine, and the basic principle is to split the data input in real time in units of time slices (second level), and then process each time slice data in a batch-like manner through Spark engine.
ElasticSearch is a Lucene-based search server. It provides a distributed multi-user capable full-text search engine based on RESTful web interface. The Elasticsearch was developed in the Java language and published as open source under the Apache licensing terms, a popular enterprise level search engine.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (8)

1. A method for building a decision tree model for real-time fake-licensed vehicle analysis, the method comprising:
s11, preparing a training data set and a verification data set;
obtaining fake-licensed car data appearing in a historical database, obtaining non-fake-licensed car data in a time range corresponding to the appearing time according to the appearing time of the fake-licensed cars, and obtaining first five-dimensional vector data of the fake-licensed cars relevant to corresponding real cars based on the fake-licensed car data and the non-fake-licensed car data, wherein the first five-dimensional vector data corresponding to any one car license number comprises the following steps: license plate number, fake-licensed vehicle occurrence time, fake-licensed vehicle occurrence place, real vehicle occurrence time and real vehicle occurrence place; and acquiring real vehicle data appearing in the historical database, and acquiring second five-dimensional vector data consisting of a plurality of real vehicle data according to the appearance time and place of each real vehicle, wherein the second five-dimensional vector data comprises: the license plate number, the first time when the real vehicle appears, the first place when the real vehicle appears, the second time when the real vehicle appears and the second place when the real vehicle appears;
obtaining three-dimensional vector data corresponding to each license plate number based on the first five-dimensional vector data and the second five-dimensional vector data, wherein the three-dimensional vector data comprises: license plate number, time difference of vehicle appearance, distance of vehicle appearance;
taking the three-dimensional vector data as a sample of a training data set and a sample of a testing data set;
s12, constructing a decision tree model;
according to each license plate number in the three-dimensional data of the training data set, respectively taking the time difference of the real vehicle and the fake-licensed vehicle and the distance of the real vehicle and the fake-licensed vehicle as characteristics, and calculating the corresponding information gain;
according to the information gain corresponding to each license plate number, a root node and a leaf node are constructed to form a preliminary decision tree model;
and verifying and pruning the preliminary decision tree model according to the training data set to obtain the decision tree model.
2. The method of claim 1, wherein the step of obtaining three-dimensional vector data corresponding to each license plate number based on the first and second five-dimensional vector data comprises:
obtaining first three-dimensional data aiming at the license plate number based on the first five-dimensional vector data, wherein the first three-dimensional data comprises: license plate number, time difference between a real vehicle and a fake-licensed vehicle, and distance between the real vehicle and the fake-licensed vehicle; and the second five-dimensional vector data is used for obtaining second three-dimensional data aiming at the license plate number, wherein the second three-dimensional data comprises: license plate number, time difference of real vehicle, distance of real vehicle;
combining the first three-dimensional data and the second three-dimensional data into three-dimensional vector data.
3. The method for establishing a decision tree model for real-time fake-licensed vehicle analysis according to claim 1 or 2, wherein the formula for calculating the information gain g (X, a) is expressed as:
g(X,A)=H(X)-H(X|A)
wherein the content of the first and second substances,
Figure FDA0004016254620000021
h (X) is the entropy of the random variable, H (X | A) is the conditional entropy of the characteristic A, n is the number of values of the characteristic A, and p is i Is the probability distribution of the ith sample in the set; wherein D represents a sample set of the respective feature X, D i Represents a feature X i One subdivision of the inner K divisions, i.e. D i Represents a feature X i Sample set of (2), D ik A sample set of partition k in feature Xi is represented.
4. The method as claimed in claim 3, wherein the step of constructing a preliminary decision tree model by constructing a root node and a leaf node according to the information gain corresponding to each license plate number comprises:
selecting the characteristic with the maximum information gain as a root node and the other characteristics as leaf nodes according to the information gain corresponding to each license plate number;
acquiring a root node and a leaf node corresponding to each feature;
and constructing a preliminary decision tree model based on the acquired root nodes and leaf nodes.
5. The method for building a decision tree model for real-time fake-licensed vehicle analysis according to any one of claims 1-2 and 4, wherein the step of verifying and pruning the preliminary decision tree model according to the training data set to obtain the decision tree model comprises:
verifying the preliminary decision tree model through a training data set;
and pruning according to the verification result and a preset formula to obtain a decision tree model.
6. The method for building a decision tree model for real-time fake-licensed vehicle analysis of claim 5, wherein the predetermined formula is specifically expressed as:
Figure FDA0004016254620000031
wherein Ap and Aq respectively represent p partition and q partition of the characteristic A, S represents a test data set, model represents a decision tree Model, and if the Model after pruning is a precision Model (A) p S) and Model (A) for Model accuracy before pruning q And the ratio of S) is more than or equal to 1, the division after pruning is effective.
7. The method for building a decision tree model for real-time fake-licensed vehicle analysis according to claim 1, wherein the three-dimensional vector data is specifically expressed as:
Figure FDA0004016254620000032
where Δ t = | t i -t j |,i,j∈[1,n]
Wherein p represents the license plate number, m1 and m2 respectively represent the unique identifiers of two sample data, ti is the appearance time of the license plate number p at m1, tj is the appearance time of the license plate number p at m2, Δ t represents the time difference of the two sample data, Δ d represents the distance corresponding to the two appearance times of the license plate number p,
Figure FDA0004016254620000033
where EARTH _ RADIUS represents the RADIUS of the EARTH, lat i Is at t i Longitude and latitude, lng, of the capturing device corresponding to time j Is at t j And the longitude and latitude of the snapshot device corresponding to the time.
8. A method of using a decision tree model for real-time fake-licensed vehicle analysis, the method comprising:
selecting the maximum division value of the time difference characteristics in the decision tree model as the length of a time window for acquiring real-time streaming data, acquiring the real-time streaming data by using spark streaming consumption Kafka, and dividing the real-time streaming data into RDD data sets with the maximum division value:
and aggregating each RDD data set through the license plate number, filtering out data with consistent places, respectively calculating the time difference of each passing record of the same license plate, and importing the time difference and the snapshot equipment point location information of each record into a decision tree model as source data to obtain the analysis result of the decision tree model, wherein the analysis result comprises the license plate information meeting the conditions of the fake-licensed vehicle and the passing record thereof.
CN201911196978.9A 2019-11-29 2019-11-29 Decision tree model establishing method and using method for real-time fake-licensed car analysis Active CN111079940B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911196978.9A CN111079940B (en) 2019-11-29 2019-11-29 Decision tree model establishing method and using method for real-time fake-licensed car analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911196978.9A CN111079940B (en) 2019-11-29 2019-11-29 Decision tree model establishing method and using method for real-time fake-licensed car analysis

Publications (2)

Publication Number Publication Date
CN111079940A CN111079940A (en) 2020-04-28
CN111079940B true CN111079940B (en) 2023-03-31

Family

ID=70311955

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911196978.9A Active CN111079940B (en) 2019-11-29 2019-11-29 Decision tree model establishing method and using method for real-time fake-licensed car analysis

Country Status (1)

Country Link
CN (1) CN111079940B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257869A (en) * 2020-09-29 2021-01-22 北京北大千方科技有限公司 Fake-licensed car analysis method and system based on random forest and computer medium
CN113806594A (en) * 2020-12-30 2021-12-17 京东科技控股股份有限公司 Business data processing method, device, equipment and storage medium based on decision tree

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103679191A (en) * 2013-09-04 2014-03-26 西交利物浦大学 An automatic fake-licensed vehicle detection method based on static state pictures
CN104200669A (en) * 2014-08-18 2014-12-10 华南理工大学 Fake-licensed car recognition method and system based on Hadoop
CN106096748A (en) * 2016-04-28 2016-11-09 武汉宝钢华中贸易有限公司 Entrucking forecast model in man-hour based on cluster analysis and decision Tree algorithms
CN106599905A (en) * 2016-11-25 2017-04-26 杭州中奥科技有限公司 Fake-licensed vehicle analysis method based on deep learning
CN107067736A (en) * 2017-04-12 2017-08-18 安徽超远信息技术有限公司 Fake-licensed car analysis method and its system based on time road network
CN107977421A (en) * 2017-11-24 2018-05-01 泰华智慧产业集团股份有限公司 The method and device of fake-licensed car analysis is carried out based on big data
CN110135318A (en) * 2019-05-08 2019-08-16 佳都新太科技股份有限公司 Cross determination method, apparatus, equipment and the storage medium of vehicle record
CN110164137A (en) * 2019-05-17 2019-08-23 湖南科创信息技术股份有限公司 Based on bayonet to the recognition methods of the fake license plate vehicle of running time and system, medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10255514B2 (en) * 2017-08-21 2019-04-09 Sap Se Automatic identification of cloned vehicle identifiers

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103679191A (en) * 2013-09-04 2014-03-26 西交利物浦大学 An automatic fake-licensed vehicle detection method based on static state pictures
CN104200669A (en) * 2014-08-18 2014-12-10 华南理工大学 Fake-licensed car recognition method and system based on Hadoop
CN106096748A (en) * 2016-04-28 2016-11-09 武汉宝钢华中贸易有限公司 Entrucking forecast model in man-hour based on cluster analysis and decision Tree algorithms
CN106599905A (en) * 2016-11-25 2017-04-26 杭州中奥科技有限公司 Fake-licensed vehicle analysis method based on deep learning
CN107067736A (en) * 2017-04-12 2017-08-18 安徽超远信息技术有限公司 Fake-licensed car analysis method and its system based on time road network
CN107977421A (en) * 2017-11-24 2018-05-01 泰华智慧产业集团股份有限公司 The method and device of fake-licensed car analysis is carried out based on big data
CN110135318A (en) * 2019-05-08 2019-08-16 佳都新太科技股份有限公司 Cross determination method, apparatus, equipment and the storage medium of vehicle record
CN110164137A (en) * 2019-05-17 2019-08-23 湖南科创信息技术股份有限公司 Based on bayonet to the recognition methods of the fake license plate vehicle of running time and system, medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于卡口的违法车辆自动检测与身份识别***设计;白亮亮等;《信息化研究》;20160820(第04期);全文 *

Also Published As

Publication number Publication date
CN111079940A (en) 2020-04-28

Similar Documents

Publication Publication Date Title
CN111614690B (en) Abnormal behavior detection method and device
Zhang et al. Group pooling for deep tourism demand forecasting
CN108280550B (en) Visual analysis method for comparing community division of public bicycle stations
CN111079940B (en) Decision tree model establishing method and using method for real-time fake-licensed car analysis
CN112447041B (en) Method and device for identifying operation behavior of vehicle and computing equipment
US20180268305A1 (en) Retrospective event verification using cognitive reasoning and analysis
CN105718587A (en) Network content resource evaluation method and evaluation system
CN107025228B (en) Question recommendation method and equipment
CN111866196B (en) Domain name traffic characteristic extraction method, device and equipment and readable storage medium
CN108833139A (en) A kind of OSSEC alert data polymerization divided based on category attribute
CN106297304A (en) A kind of based on MapReduce towards the fake-licensed car recognition methods of extensive bayonet socket data
Xi et al. A hybrid algorithm of traffic accident data mining on cause analysis
CN109474691B (en) Method and device for identifying equipment of Internet of things
CN113205134A (en) Network security situation prediction method and system
CN112364176A (en) Method, equipment and system for constructing personnel action track
Xue et al. A context-aware framework for risky driving behavior evaluation based on trajectory data
Vdovic et al. Eco-efficient driving pattern evaluation for sustainable road transport based on contextually enriched automotive data
Groff Measuring a place’s exposure to facilities using geoprocessing models: An illustration using drinking places and crime
CN108460633A (en) A kind of method for building up and application thereof of advertisement audio collection identifying system
Chen et al. Speed distribution prediction of freight vehicles on mountainous freeway using deep learning methods
CN112256549B (en) Log processing method and device
Bharathi et al. A supervised learning approach for criminal identification using similarity measures and K-Medoids clustering
Steur Twitter as a spatio-temporal source for incident management
CN114708485A (en) Method for acquiring flood disaster information from social media
CN106899668B (en) Information Push Service processing method in car networking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant