The content of the invention
Embodiments of the invention provide a kind of processing method and processing device of data characteristics, can reduce the cost of data extraction and improve the accuracy of data extraction.
To reach above-mentioned purpose, embodiments of the invention are adopted the following technical scheme that:
In a first aspect, embodiments of the invention provide a kind of processing method of data characteristics, including:
From business log acquisition plaintext sample, the plaintext sample at least includes special field and feature field, and the special field includes the field for being used to represent to perform order and operational order;
According to the feature class being pre-configured with, feature is obtained in plain text from the feature field, and records sample signature, wherein, the same sample signature of content identical special field correspondence;
A special field of the correspondence sample signature is extracted, and by acquired feature in plain text, splices to one special field, obtains spliced field;
The spliced field is exported as feature samples.
It is described from business log acquisition plaintext sample with reference in a first aspect, in the first possible implementation of first aspect, including:
Read the clear text field in the business diary;
First kind field is rejected in the clear text field;And/or, the character of Second Type field in the clear text field is changed into true-to-shape;
By MapReduce frameworks, the field after rejecting and/or conversion process is stored in internal memory in Map modes.
With reference in a first aspect, in second of possible implementation of first aspect, the feature class that the basis is pre-configured with obtains feature in plain text from the feature field, including:
It is successively read the field in the field in the feature class, the feature class identical with the content of at least one field in the plaintext sample;
The content of field in the feature class, is successively read the field with identical content as the feature field from the plaintext sample;
By the feature field being successively read from plaintext sample record in characteristic set.
It is described to export the spliced field as feature samples in the third possible implementation with reference to second of possible implementation of first aspect, including:
By MapReduce frameworks, the feature samples and the characteristic set are imported into the Reduce stages;
It is described to record the feature field being successively read from the plaintext sample in characteristic set, including:The identical feature field read from the plaintext sample is output to identical calculations node.
With reference in a first aspect, in the 4th kind of possible implementation of first aspect, in addition to:
Essential characteristic class is read, and the essential characteristic class is updated by reflex mechanism;
It regard the essential characteristic class of last update as the feature class being pre-configured with.
Second aspect, embodiments of the invention provide a kind of processing unit of data characteristics, including:
Extraction unit, for from business log acquisition plaintext sample, the plaintext sample at least to include special field and feature field, and the special field is including being used for expression execution order and the field of operational order;
Recognition unit, for according to the feature class being pre-configured with, feature to be obtained in plain text from the feature field, and records sample signature, wherein, the same sample signature of content identical special field correspondence;
Concatenation unit, a special field for extracting the correspondence sample signature, and by acquired feature in plain text, splice to one special field, obtain spliced field;
Output unit, for the spliced field to be exported as feature samples.
With reference to second aspect, in the first possible implementation of second aspect, in addition to pretreatment unit, for reading the clear text field in the business diary;And first kind field is rejected in the clear text field;And/or, the character of Second Type field in the clear text field is changed into true-to-shape;Again by MapReduce frameworks, the field after rejecting and/or conversion process is stored in internal memory in Map modes.
With reference to second aspect, in second of possible implementation of second aspect, the recognition unit, specifically for the field being successively read in the feature class, the field in the feature class is identical with the content of at least one field in the plaintext sample;And the content of the field in the feature class, the field with identical content is successively read from the plaintext sample as the feature field;Again by the feature field being successively read from plaintext sample record in characteristic set.
With reference to second of possible implementation of second aspect, in the third possible implementation, the output unit, specifically for by MapReduce frameworks, the feature samples and the characteristic set are imported into the Reduce stages;And the identical feature field read from the plaintext sample is output to identical calculations node.
With reference to second aspect, in the 4th kind of possible implementation of second aspect, in addition to feature class administrative unit, update the essential characteristic class for reading essential characteristic class, and by reflex mechanism;And it regard the essential characteristic class of last update as the feature class being pre-configured with.
The processing method and processing device of data characteristics provided in an embodiment of the present invention, according to the feature class being pre-configured with, feature plaintext is obtained from the feature field of plaintext sample and records sample signature, and extract a special field of the correspondence sample signature, feature is spliced with special field in plain text, the spliced field is exported as feature samples again, feature samples used are extracted as data.Relative to prior art, the present embodiment extracts required feature from mass data, the data for being difficult to extract extensive and various dimensions in the prior art are solved, having extenuated needs the problem of frequent updating is modeled, so as to reduce the cost of data extraction and improve the accuracy of data extraction.
Embodiment
To make those skilled in the art more fully understand technical scheme, the present invention is described in further detail with reference to the accompanying drawings and detailed description.Embodiments of the present invention are described in more detail below, the example of the embodiment is shown in the drawings, wherein same or similar label represents same or similar element or the element with same or like function from beginning to end.The embodiment described below with reference to accompanying drawing is exemplary, is only used for explaining the present invention, and is not construed as limiting the claims.Those skilled in the art of the present technique are appreciated that unless expressly stated singulative " one " used herein, " one ", " described " and "the" may also comprise plural form.It should be further understood that, the wording " comprising " that uses refers to there is the feature, integer, step, operation, element and/or component in the specification of the present invention, but it is not excluded that in the presence of or add other one or more features, integer, step, operation, element, component and/or their group.It should be understood that when we claim element to be " connected " or during " coupled " to another element, it can be directly connected or coupled to other elements, or can also have intermediary element.In addition, " connection " used herein or " coupling " can include wireless connection or coupling.Wording "and/or" used herein includes one or more associated any cells for listing item and all combined.Those skilled in the art of the present technique are appreciated that unless otherwise defined all terms (including technical term and scientific terminology) used herein have the general understanding identical meaning with the those of ordinary skill in art of the present invention.It should also be understood that those terms defined in such as general dictionary should be understood that with the meaning consistent with the meaning in the context of prior art, and unless defined as here, it will not be explained with idealization or excessively formal implication.
The present embodiment can be using the distributed treatment framework (also referred to as MapReduce frameworks) based on MapReduce, and the specific framework of the MapReduce frameworks wherein used in the present embodiment can be as shown in Figure 1.In the process of implementation, pending data are existed in internal memory in map modes.According to be specifically the MapReduce frameworks based on hadoop, for the extraction of feature, the feature field and special field of data are extracted and exported in the map stages, and in the reduce stages, accumulative identical feature field;For sample, sampling is carried out in the map stages, the feature samples that have recorded sample signature are exported in the reduce stages.
The embodiment of the present invention provides a kind of processing method of data characteristics, as shown in Fig. 2 including:
S1, from business log acquisition plaintext sample.
Wherein, plaintext sample at least includes special field and feature field, and special field includes the field for being used to represent to perform order and operational order.The daily record data that business diary is recorded when can be operation system operation, for example:The daily record data that advertisement delivery system is recorded when running.Plaintext sample can be the non-encrypted character in business diary, and acquired plaintext sample is specifically as follows the textual form for meeting tab separations, and the special field including being used to represent " exist and show " and " click ", such as:" show " and " clk ".
Process S1-S4, the server in map stages can specifically be performed in MapReduce frameworks.
S2, according to the feature class being pre-configured with, obtains feature in plain text, and record sample signature from the feature field.
In the present embodiment, the server in map stages reads the feature class that is pre-configured with, and feature class includes the field that is configured according to sequencing in feature class, and the field in feature class is identical with the content of at least one field in plaintext sample.The server in map stages reads the plaintext sample of input in key-value modes, and exist in internal memory in map modes according to the feature class being pre-configured with.Wherein, the internal memory described in the present embodiment can be specifically the internal memory of the local device of user or the internal memory of the server in map stages.
The server in map stages, which can be peeled off first, is used for the special field for representing " exist and show " and " click " in plaintext sample;Further according to the field contents described in the feature class being pre-configured with, the extraction feature field successively from plaintext sample.Sample signature correspondence plaintext sample, and for representing the special field of " exist and show " and " click " often repeatedly, therefore the same sample signature of content identical special field correspondence in same plaintext sample in plaintext sample.Wherein, sample signature can plaintext sample be pre-configured with by server-assignment or in plaintext sample when existing in map modes in internal memory.
S3, extracts a special field of the correspondence sample signature, and by acquired feature in plain text, splices to one special field, obtain spliced field.
For example:For plaintext sample:" show clk A ..., show clk B ..., show clk C ..., show clk D ",
Wherein, special field is " show clk ", feature field is " A B C D ", therefore can obtain feature:A show clk, B show clk, C show clk, D show clk, spliced field is obtained by splicing:“show clk feaA feaB feaC feaD”.
S4, the spliced field is exported as feature samples.
Wherein, feature samples can be output to the server in reduce stages by the server in map stages.
In the present embodiment, for the extraction of feature, need in the map stages according to the feature class being pre-configured with, feature is obtained from feature field in plain text, the feature class being pre-configured with can be obtained by the reflex mechanism in java, in order to which user is when extracting feature, for general requirment, without being based on tables of data development features extraction procedure using prior art;For specific demand, the feature extraction framework (running the MapReduce frameworks that this implementation performs flow) of the present embodiment need to be only used, according to the feature class being pre-configured with, required feature is extracted from mass data.
The reflex mechanism used in the present embodiment includes:In compiling and it is uncertain be which class needs to be loaded, but specific class is just loaded when program is run, so as to obtain the structure attribute of class.Use the class being not aware that in compiling duration.Such as:After a class is loaded, Java Virtual Machine automatically generates a Class object, and is loaded into the information such as statement and definition of this corresponding method of Class objects, member and building method among virtual machine by this Class object acquisition.For concrete example, the process for obtaining the feature class being pre-configured with by the reflex mechanism in java can include:
Utilize java reflex mechanisms, defined feature class factory class (Feature), such as shown in following codes:
And in extraction feature under personal business configuration configuration feature class class name, wherein supporting the multiple many features of slot of configuration.And need not load in advance.
User profile is parsed when calling afterwards to obtain feature class name according to No. slot and reflect feature analysis class, is used for feature extractor with extraction feature.Wherein it is possible to increase any kind of feature extraction service class by specific business demand, feature class name is configured in configuration file, and the feature class oneself write is used for different slot during feature extraction.Further, the processing of pretreatment class also individually defines a pretreatment factory class, to utilize java reflex mechanism.
The processing method of data characteristics provided in an embodiment of the present invention, according to the feature class being pre-configured with, feature plaintext is obtained from the feature field of plaintext sample and records sample signature, and extract a special field of the correspondence sample signature, feature is spliced with special field in plain text, the spliced field is exported as feature samples again, feature samples used are extracted as data.Relative to prior art, the present embodiment extracts required feature from mass data, the data for being difficult to extract extensive and various dimensions in the prior art are solved, having extenuated needs the problem of frequent updating is modeled, so as to reduce the cost of data extraction and improve the accuracy of data extraction.
In the present embodiment, can be with to there is plaintext sample in internal memory in map modes or before plaintext sample is stored in internal memory, the field in plaintext sample pre-processed, such as in the server in map stages:Based on the character of the coded systems such as URL-ENCODE, base64, the pretreatments such as the conversion of half-angle full-shape, English capital and small letter conversion can be carried out, user-defined preprocessing process can also be included.Therefore it is described from business log acquisition plaintext sample, including:
Read the clear text field in the business diary.First kind field is rejected in the clear text field.And/or, the character of Second Type field in the clear text field is changed into true-to-shape.By MapReduce frameworks, the field after rejecting and/or conversion process is stored in internal memory in Map modes.
Wherein, first kind field refers to the field that there is error in data, can not read, or is intended to indicate that the character of certain content (such as:The character of certain content can include being used for character, the decollator on expression modification date etc.);Second Type field refers to can be converted, such as:Carry out the character of the conversion of half-angle full-shape or English capital and small letter conversion, the true-to-shape that the character style after changing pre-sets into user, or the form prestored in the server in map stages.
In the present embodiment, the feature class that the basis is pre-configured with obtains feature in plain text from the feature field, including:
It is successively read the field in the feature class.And the content of the field in the feature class, the field with identical content is successively read from the plaintext sample as the feature field.Again by the feature field being successively read from plaintext sample record in characteristic set.
Wherein, the field in the feature class is identical with the content of at least one field in the plaintext sample.Specifically, the server in map stages obtains new plaintext sample set, the feature class that the preparation being pre-configured with is extracted is initialized here, according to configuration the need for the feature that extracts, call feature class to do feature extraction one by one.For example:
Plaintext sample is:“show clk A B C D”;
The feature class being pre-configured with includes:
Feaclass=featureclass1;Dpd=A;Slot=1,
Feaclass=featureclass2;Dpd=B;Slot=2,
Feaclass=featureclass3;Dpd=C;Slot=3,
Feaclass=featureclass4;Dpd=D;Slot=4,
Wherein, server can initialize featureclass1, featureclass2, featureclass3 and featureclass4, according still further to configuration sequence, successively extraction feature feaA, feaB, until feaD.The characteristic set { feaA, feaB, feaC, feaD } that server is extracted, and plaintext sample show clk A B C D, the relation that server is completed according to the relation between special field and feature field between the process of splicing, field can include:{ feaA show clk ... }, final splicing completion obtains a feature samples:show clkfeaAfeaBfeaCfeaD.
In the present embodiment, it is described to export the spliced field as feature samples, including:
By MapReduce frameworks, the feature samples and the characteristic set are imported into the Reduce stages.It is described to record the feature field being successively read from the plaintext sample in characteristic set, including:The identical feature field read from the plaintext sample is output to identical calculations node.
For example:The present embodiment can use hadoop MapReduce frameworks, perform S1-S4 by the server in map stages, then (implementing result includes by implementing result:Feature samples and characteristic set) it is output to the server in reduce stages.If specifically, feature samples, then be directly output to reduce, do not process;If characteristic set, then using point bucket principle of MapReduce frameworks, identical feature is assigned in identical calculations node.The server in reduce stages, receives feature samples, then direct output characteristic sample;Characteristic set is received, then is exported again after the corresponding show clk values of characteristic set that add up.
In the present embodiment, in addition to:
Essential characteristic class is read, and the essential characteristic class is updated by reflex mechanism.
It regard the essential characteristic class of last update as the feature class being pre-configured with.
The embodiment of the present invention also provides a kind of processing unit of data characteristics, if applying in MapReduce frameworks, in the server that specifically may operate in the map stages, and as shown in Figure 3 a, the processing unit includes:
Extraction unit, for from business log acquisition plaintext sample, the plaintext sample at least to include special field and feature field, and the special field is including being used for expression execution order and the field of operational order.
Recognition unit, for according to the feature class being pre-configured with, feature to be obtained in plain text from the feature field, and records sample signature, wherein, the same sample signature of content identical special field correspondence.
Concatenation unit, a special field for extracting the correspondence sample signature, and by acquired feature in plain text, splice to one special field, obtain spliced field.
Output unit, for the spliced field to be exported as feature samples.
In the present embodiment, the recognition unit, specifically for the field being successively read in the feature class, the field in the feature class is identical with the content of at least one field in the plaintext sample.And the content of the field in the feature class, the field with identical content is successively read from the plaintext sample as the feature field.Again by the feature field being successively read from plaintext sample record in characteristic set.
In the present embodiment, the output unit, specifically for by MapReduce frameworks, the feature samples and the characteristic set are imported into the Reduce stages.And the identical feature field read from the plaintext sample is output to identical calculations node.
Further, as shown in Figure 3 b, in addition to:Pretreatment unit, for reading the clear text field in the business diary.And first kind field is rejected in the clear text field.And/or, the character of Second Type field in the clear text field is changed into true-to-shape.Again by MapReduce frameworks, the field after rejecting and/or conversion process is stored in internal memory in Map modes.
Further, as shown in Figure 3 c, in addition to feature class administrative unit, the essential characteristic class is updated for reading essential characteristic class, and by reflex mechanism.And it regard the essential characteristic class of last update as the feature class being pre-configured with.
The processing unit of data characteristics provided in an embodiment of the present invention, according to the feature class being pre-configured with, feature plaintext is obtained from the feature field of plaintext sample and records sample signature, and extract a special field of the correspondence sample signature, feature is spliced with special field in plain text, the spliced field is exported as feature samples again, feature samples used are extracted as data.Relative to prior art, the present embodiment extracts required feature from mass data, the data for being difficult to extract extensive and various dimensions in the prior art are solved, having extenuated needs the problem of frequent updating is modeled, so as to reduce the cost of data extraction and improve the accuracy of data extraction.
Each embodiment in this specification is described by the way of progressive, and identical similar part is mutually referring to what each embodiment was stressed is the difference with other embodiment between each embodiment.For apparatus embodiments, because it is substantially similar to embodiment of the method, so describing fairly simple, the relevent part can refer to the partial explaination of embodiments of method.One of ordinary skill in the art will appreciate that realizing all or part of flow in above-described embodiment method, computer program is can be by instruct the hardware of correlation to complete, described program can be stored in a computer read/write memory medium, the program is upon execution, it may include such as the flow of the embodiment of above-mentioned each method.Wherein, described storage medium can be magnetic disc, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..It is described above; only embodiment of the invention, but protection scope of the present invention is not limited thereto, any one skilled in the art the invention discloses technical scope in; the change or replacement that can be readily occurred in, should all be included within the scope of the present invention.Therefore, protection scope of the present invention should be defined by scope of the claims.