CN104504021A - Data matching method and device - Google Patents

Data matching method and device Download PDF

Info

Publication number
CN104504021A
CN104504021A CN201410766705.4A CN201410766705A CN104504021A CN 104504021 A CN104504021 A CN 104504021A CN 201410766705 A CN201410766705 A CN 201410766705A CN 104504021 A CN104504021 A CN 104504021A
Authority
CN
China
Prior art keywords
data acquisition
multimedia file
data
mentioned
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410766705.4A
Other languages
Chinese (zh)
Inventor
焦张波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201410766705.4A priority Critical patent/CN104504021A/en
Publication of CN104504021A publication Critical patent/CN104504021A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/156Query results presentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/162Delete operations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data matching method and device, wherein the method comprises the following steps that a first data set and a second data set are obtained, wherein the first data set at least comprises a preset first multimedia file name, and the second data set at least comprises a second multimedia file name played at a client; the first data set is subjected to data cleaning according to a preset condition to obtain a first target data set, and in addition, the second data set is subjected to data cleaning according to a preset condition to obtain a second target set, wherein the cleaning is used for filtering feature data in the first data set and the second data set; the first target data set and the second target data set are matched. The method and the device solve the technical problem of low data matching accuracy due to the adoption of the data matching mode provided in the prior art.

Description

Data matching method and device
Technical field
The present invention relates to computer realm, in particular to a kind of data matching method and device.
Background technology
Nowadays, along with the continuous expansion development of the network platform, the resource that network provides is also more and more abundanter, and thus, increasing people selects to watch online multimedia file by network.Further, the column group of some network platforms, in order to provide the Internet resources more meeting user preferences to user, just needs to do further statistical study to the broadcasting behavior of user.
At present, in order to solve the problem, mode conventional in prior art is that above-mentioned column group can utilize the multimedia file list that pre-establishes and user to select the multimedia file watched directly to carry out Data Matching, to obtain the number of users of each multimedia file selecting this column group to provide, and then reach the object of the broadcasting behavior of user being carried out to statistical study.Specifically, the title of each multimedia file in the multimedia file list pre-established is utilized to select the title of the multimedia file play directly to mate with user, if it is consistent to go out the two title through matching judgment, then think that user selects have viewed multimedia file corresponding in this multimedia file list, otherwise, think user multimedia file corresponding in this multimedia file list of non-selected viewing.But because multimedia file provider is different from the demand of the multimedia file side of watching, thus, usual user selects that the title of the multimedia file in the title of multimedia file play and multimedia file list is arranged and inconsistent.
Thus, adopt existing Data Matching mode that a part of played data will be caused to run off, and then cause Data Matching to omit, the inaccurate problem of matching result, thus affect the accuracy to the broadcasting behavioural analysis of user further.
For the problem in correlation technique, at present effective solution is not yet proposed.
Summary of the invention
Fundamental purpose of the present invention is to provide a kind of data matching method and device, the technical matters that the Data Matching accuracy caused in the Data Matching mode solved owing to adopting prior art to provide is lower.
According to an aspect of the present invention, provide a kind of data matching method, the method comprises: obtain the first data acquisition and the second data acquisition, wherein, at least comprise the first multimedia file title pre-set in above-mentioned first data acquisition, in above-mentioned second data acquisition, at least comprise the second multimedia file title of client terminal playing; According to predetermined condition, data cleansing is carried out to obtain first object data acquisition to above-mentioned first data acquisition, and according to above-mentioned predetermined condition, data cleansing is carried out to obtain the second target data set to above-mentioned second data acquisition, wherein, above-mentioned cleaning is for filtering the characteristic in above-mentioned first data acquisition and above-mentioned second data acquisition; Mate above-mentioned first object data acquisition and above-mentioned second target data set.
Alternatively, according to predetermined condition, data cleansing is carried out to obtain first object data acquisition to above-mentioned first data acquisition above-mentioned, and according to above-mentioned predetermined condition, data cleansing is carried out with before obtaining the second target data set to above-mentioned second data acquisition, also comprise: the property data base setting up the above-mentioned characteristic comprised for above-mentioned data cleansing, wherein, above-mentioned characteristic at least comprises feature string, characteristic key words.
Alternatively, above-mentioned set up property data base comprise following one of at least: detect in above-mentioned first data acquisition and above-mentioned second data acquisition whether comprise above-mentioned feature string and above-mentioned characteristic key words, and the above-mentioned feature string detected and above-mentioned characteristic key words added in above-mentioned property data base; Obtain the characteristic character set of strings and/or feature critical set of words of having preserved in database, and above-mentioned characteristic character set of strings and/or feature critical set of words are added in above-mentioned property data base.
Alternatively, above-mentionedly according to predetermined condition, data cleansing is carried out to obtain first object data acquisition to above-mentioned first data acquisition, and according to above-mentioned predetermined condition, data cleansing is carried out to above-mentioned second data acquisition and comprise to obtain the second target data set: search whether comprise above-mentioned feature string and/or above-mentioned characteristic key words according in the above-mentioned second multimedia file title in the above-mentioned first multimedia file title of above-mentioned property data base in above-mentioned first data acquisition and in above-mentioned second data acquisition; If find above-mentioned feature string and/or above-mentioned characteristic key words, then delete above-mentioned feature string and/or above-mentioned characteristic key words.
Alternatively, also feature phrase is comprised in above-mentioned characteristic, wherein, above-mentionedly according to predetermined condition, data cleansing is carried out to obtain first object data acquisition to above-mentioned first data acquisition, and according to above-mentioned predetermined condition, data cleansing is carried out to above-mentioned second data acquisition and also comprise to obtain the second target data set: utilize above-mentioned feature phrase to carry out canonical with the above-mentioned first multimedia file title in above-mentioned first data acquisition and mate, above-mentioned feature phrase in above-mentioned first multimedia file title in above-mentioned first data acquisition is filtered and deletes with the target designation obtaining above-mentioned multimedia file, and the above-mentioned first multimedia file title in above-mentioned first data acquisition is updated to the above-mentioned target designation of above-mentioned multimedia file, to obtain above-mentioned first object data acquisition, utilize above-mentioned feature phrase to carry out canonical with the above-mentioned second multimedia file title in above-mentioned second data acquisition to mate, above-mentioned feature phrase in above-mentioned second multimedia file title in above-mentioned second data acquisition is filtered and deletes with the above-mentioned target designation obtaining above-mentioned multimedia file, and the above-mentioned second multimedia file title in above-mentioned second data acquisition is updated to the above-mentioned target designation of above-mentioned multimedia file, to obtain above-mentioned second target data set.
Alternatively, above-mentioned coupling above-mentioned first object data acquisition and above-mentioned second target data set comprise: the above-mentioned target designation being searched the above-mentioned multimedia file in above-mentioned second target data set by the above-mentioned target designation of the above-mentioned multimedia file in above-mentioned first object data acquisition; The program logo bound with the above-mentioned target designation of above-mentioned multimedia file in above-mentioned first object data acquisition is mated with the client identification bound with the above-mentioned target designation of above-mentioned multimedia file in above-mentioned second data acquisition.
According to a further aspect in the invention, provide a kind of data matching device, this device comprises: acquiring unit, for obtaining the first data acquisition and the second data acquisition, wherein, at least comprise the first multimedia file title pre-set in above-mentioned first data acquisition, in above-mentioned second data acquisition, at least comprise the second multimedia file title of client terminal playing; Cleaning unit, for carrying out data cleansing to obtain first object data acquisition according to predetermined condition to above-mentioned first data acquisition, and according to above-mentioned predetermined condition, data cleansing is carried out to obtain the second target data set to above-mentioned second data acquisition, wherein, above-mentioned cleaning is for filtering the characteristic in above-mentioned first data acquisition and above-mentioned second data acquisition; First matching unit, for mating above-mentioned first object data acquisition and above-mentioned second target data set.
Alternatively, said apparatus also comprises: set up unit, for carrying out data cleansing to obtain first object data acquisition according to predetermined condition to above-mentioned first data acquisition above-mentioned, and according to above-mentioned predetermined condition, data cleansing is carried out with before obtaining the second target data set to above-mentioned second data acquisition, set up the property data base of the above-mentioned characteristic comprised for above-mentioned data cleansing, wherein, above-mentioned characteristic at least comprises feature string, characteristic key words.
Alternatively, above-mentioned set up unit comprise following one of at least: first sets up module, for detecting in above-mentioned first data acquisition and above-mentioned second data acquisition whether comprise above-mentioned feature string and above-mentioned characteristic key words, and the above-mentioned feature string detected and above-mentioned characteristic key words are added in above-mentioned property data base; Second sets up module, for obtaining the characteristic character set of strings and/or feature critical set of words of having preserved in database, and above-mentioned characteristic character set of strings and/or feature critical set of words is added in above-mentioned property data base.
Alternatively, above-mentioned cleaning unit comprises: first searches module, for searching whether comprise above-mentioned feature string and/or above-mentioned characteristic key words according in the above-mentioned second multimedia file title in the above-mentioned first multimedia file title of above-mentioned property data base in above-mentioned first data acquisition and in above-mentioned second data acquisition; Removing module, for when finding above-mentioned feature string and/or above-mentioned characteristic key words, deletes above-mentioned feature string and/or above-mentioned characteristic key words.
Alternatively, also feature phrase is comprised in above-mentioned characteristic, wherein, above-mentioned cleaning unit also comprises: the first filtering module, carrying out canonical for utilizing above-mentioned feature phrase with the above-mentioned first multimedia file title in above-mentioned first data acquisition to mate, the above-mentioned feature phrase in the above-mentioned first multimedia file title in above-mentioned first data acquisition being filtered and deletes with the target designation obtaining above-mentioned multimedia file; First update module, for the above-mentioned first multimedia file title in above-mentioned first data acquisition being updated to the above-mentioned target designation of above-mentioned multimedia file, to obtain above-mentioned first object data acquisition; Second filtering module, carrying out canonical for utilizing above-mentioned feature phrase with the above-mentioned second multimedia file title in above-mentioned second data acquisition to mate, the above-mentioned feature phrase in the above-mentioned second multimedia file title in above-mentioned second data acquisition being filtered and deletes with the above-mentioned target designation obtaining above-mentioned multimedia file; Second update module, for the above-mentioned second multimedia file title in above-mentioned second data acquisition being updated to the above-mentioned target designation of above-mentioned multimedia file, to obtain above-mentioned second target data set.
Alternatively, above-mentioned first matching unit comprises: second searches module, for being searched the above-mentioned target designation of the above-mentioned multimedia file in above-mentioned second target data set by the above-mentioned target designation of the above-mentioned multimedia file in above-mentioned first object data acquisition; Matching module, for mating the program logo bound with the above-mentioned target designation of above-mentioned multimedia file in above-mentioned first object data acquisition with the client identification bound with the above-mentioned target designation of above-mentioned multimedia file in above-mentioned second data acquisition.
By the embodiment that the application provides, by carrying out data cleansing to the characteristic in the first data acquisition got and the second data acquisition, thus make the first object data acquisition after cleaning and the data in the second target data set realize accurate match, the problems such as the inconsistent coupling omission caused of multimedia names provided due to both sides in prior art are provided, and then avoid the low problem of Data Matching accuracy owing to adopting the Data Matching mode of prior art to cause, reach the object improving Data Matching accuracy.Further, owing to having carried out data cleansing according to predetermined condition to the first data acquisition and the second data acquisition, thus make the multimedia file title after cleaning more succinct, and then improve Data Matching efficiency, save the time of data analysis.
Accompanying drawing explanation
The accompanying drawing forming a application's part is used to provide a further understanding of the present invention, and schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the process flow diagram of a kind of optional data matching method according to the embodiment of the present invention;
Fig. 2 is the schematic diagram of a kind of optional data matching device according to the embodiment of the present invention.
Embodiment
It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.Below with reference to the accompanying drawings and describe the present invention in detail in conjunction with the embodiments.
Embodiment 1
According to the embodiment of the present invention, provide a kind of data matching method, as shown in Figure 1, the method comprises:
S102, obtains the first data acquisition and the second data acquisition, wherein, at least comprises the first multimedia file title pre-set in the first data acquisition, at least comprises the second multimedia file title of client terminal playing in the second data acquisition;
S104, according to predetermined condition, data cleansing is carried out to obtain first object data acquisition to the first data acquisition, and according to predetermined condition, data cleansing is carried out to obtain the second target data set to the second data acquisition, wherein, the characteristic for filtering in the first data acquisition and the second data acquisition is cleaned;
S106, coupling first object data acquisition and the second target data set.
Alternatively, in the present embodiment, each column group statistics client that above-mentioned data matching method can be, but not limited to be applied to the network platform is watched in the process of online multimedia file, such as, obtain the programme comprising multimedia file title that each column group pre-sets, and the multimedia file that the user that reports of each client watches multimedia file plays record, then data cleansing is carried out to above-mentioned programme, to obtain original multimedia file title, and data cleansing is carried out to multimedia file broadcasting record, to obtain above-mentioned original multimedia file title, the multimedia file title utilizing this original, the program logo of column group is mated with the client identification of client, thus overcome in prior art due to programme and multimedia file play record in multimedia file title inconsistent caused mate the low problem of accuracy, and then improve the accuracy of Data Matching, further raising is to the accuracy of the broadcasting behavioural analysis of user, so that provide better resource for user.Above-mentioned citing is a kind of example, and the present embodiment is not limited in any way this.
It should be noted that, preset column group mark and the binding relationship of the first multimedia file title in the programme that above-mentioned column group provides, in the broadcasting record that above-mentioned client provides, include the binding relationship of client identification and the second multimedia file title.Such as, to provide programme as shown in table 1 for TV play column group.
Table 1
Further, the data that provide of client are as shown in table 2.
Table 2
Alternatively, in the present embodiment, above-mentioned characteristic can include but not limited to be arranged in the property data base set up in advance, wherein, the type of above-mentioned characteristic can include but not limited to following one of at least: feature string, characteristic key words, feature phrase.Wherein, feature string can include but not limited to punctuation mark, such as, and pause mark, comma, fullstop, punctuation marks used to enclose the title, exclamation etc., again such as, double byte character, half-angle character, Chinese character, English character.Above-mentioned characteristic key words can include but not limited to: for identifying the keyword of playing progress rate, such as, plays date, end etc.Above-mentioned feature phrase can include but not limited to, for distinguishing the phrase of multimedia file of the same name, such as, and first, first season etc.
Further, in the present embodiment, when setting up above-mentioned property data base, the mode obtaining characteristic can include but not limited to: from database, obtain the characteristic set of having preserved, from the first data acquisition and the second data acquisition, detect special characteristic.
Alternatively, in the present embodiment, above-mentioned predetermined condition can include but not limited to carry out data cleansing according to the characteristic in property data base in the first data acquisition and the second data acquisition, thus obtain the first object data acquisition after cleaning and the second datum target set, and then realize mating the client identification in the program logo in the first datum target set and the second datum target set, to obtain the viewership of client to each multimedia file.
Alternatively, in the present embodiment, above-mentioned cleaning can include but not limited to filter and delete, that is, the program that column group is provided but in the broadcasting record that provides of the first multimedia file title and client in the second multimedia file title unnecessary information filtering delete, thus obtain original multimedia file title, avoid the problem that the coupling accuracy that causes because multimedia file title is inconsistent is low.
By the embodiment that the application provides, by carrying out data cleansing to the characteristic in the first data acquisition got and the second data acquisition, thus make the first object data acquisition after cleaning and the data in the second target data set realize accurate match, the problems such as the inconsistent coupling omission caused of multimedia names provided due to both sides in prior art are provided, and then avoid the low problem of Data Matching accuracy owing to adopting the Data Matching mode of prior art to cause, reach the object improving Data Matching accuracy.Further, owing to having carried out data cleansing according to predetermined condition to the first data acquisition and the second data acquisition, thus make the multimedia file title after cleaning more succinct, and then improve Data Matching efficiency, save the time of data analysis.
As the optional scheme of one, according to predetermined condition, data cleansing is being carried out to obtain first object data acquisition to the first data acquisition, and according to predetermined condition, data cleansing is being carried out with before obtaining the second target data set to the second data acquisition, also comprising:
S1, set up the property data base of the characteristic comprised for data cleansing, wherein, characteristic at least comprises feature string, characteristic key words.
Alternatively, in the present embodiment, after acquisition first data acquisition and the second data acquisition, before according to predetermined condition data cleansing being carried out to the first data acquisition and the second data acquisition, set up the property data base being used for above-mentioned data cleansing, thus judge that above-mentioned first data acquisition and the second data acquisition are the need of carrying out data cleansing by the characteristic in above-mentioned property data base, if find above-mentioned characteristic, then data cleansing is carried out to the characteristic in above-mentioned first data acquisition and the second data acquisition.
By the embodiment that the application provides, by setting up the property data base of the characteristic comprised for data cleansing in advance, thus realize carrying out data cleansing to the first data acquisition and the second data acquisition fast according to property data base, save the time of data cleansing, and then while improve the speed of data cleansing, also improve the efficiency of Data Matching.
As the optional scheme of one, set up property data base comprise following one of at least:
S1, detects in the first data acquisition and the second data acquisition whether comprise feature string and characteristic key words, and adds in property data base by the feature string detected and characteristic key words;
S2, obtains the characteristic character set of strings and/or feature critical set of words of having preserved in database, and characteristic character set of strings and/or feature critical set of words is added in property data base.
Alternatively, in the present embodiment, the characteristic character set of strings of having preserved in above-mentioned database and/or feature critical set of words can include but not limited to: punctuation mark complete or collected works table, the keyword data set of setting up in advance.That is, existing characteristic character set of strings and/or feature critical set of words are directly added in property data base.
Specifically be described in conjunction with following example, to detect in the first data acquisition and the second data acquisition whether comprise feature string and characteristic key words, by carrying out data characteristics analysis to the first data acquisition and the second data acquisition, the character string special using some and/or keyword as feature string and characteristic key words, and are added in property data base.
By the embodiment that the application provides, by obtaining characteristic from the first data acquisition and the second data acquisition, or, characteristic is obtained from database, and then added to property data base to set up the property data base being used for data cleansing, make the characteristic of acquisition comparatively extensive, comprehensively, thus ensure that the accuracy of data cleansing, avoid because the problem causing affecting Data Matching is omitted in cleaning.
As the optional scheme of one, according to predetermined condition, data cleansing is carried out to obtain first object data acquisition to the first data acquisition, and according to predetermined condition, data cleansing is carried out to the second data acquisition and comprise to obtain the second target data set:
S1, searches whether comprise feature string and/or characteristic key words according in the second multimedia file title in the first multimedia file title of property data base in the first data acquisition and in the second data acquisition;
S2, if find feature string and/or characteristic key words, then deletes feature string and/or characteristic key words.
Specifically be described in conjunction with following example, shown in associative list 1, search whether comprise feature string in the first multimedia file title in the first data acquisition, such as, find the first multimedia file title as shown in table 1: " Y passes, first ", wherein, comprise punctuation marks used to enclose the title, comma, then by above-mentioned punctuation marks used to enclose the title, comma is deleted.Further, search whether comprise characteristic key words in the first multimedia file title in the first data acquisition, such as, find the first multimedia file title as shown in table 1: " Z remembers second end ", wherein, except special string punctuation marks used to enclose the title, also comprise characteristic key words " end ", thus, above-mentioned characteristic key words " end " need be deleted.Further, shown in associative list 2, to the data cleansing process of the second multimedia file title in the second data acquisition and the data cleansing process of the first data acquisition similar, the present embodiment does not repeat them here.
By the embodiment that the application provides, by utilizing the feature string in characteristic and characteristic key words, data cleansing is carried out to the first data acquisition and the second data acquisition, thus unnecessary characteristic is filtered deletion, to make the first multimedia file title in the first data acquisition after cleaning consistent with the second multimedia file title in the second data acquisition after cleaning, be convenient to further Data Matching.
As the optional scheme of one, also feature phrase is comprised in characteristic, wherein, according to predetermined condition, data cleansing is carried out to obtain first object data acquisition to the first data acquisition, and according to predetermined condition, data cleansing is carried out to the second data acquisition and also comprise to obtain the second target data set:
S1, utilize feature phrase to carry out canonical with the first multimedia file title in the first data acquisition to mate, feature phrase in the first multimedia file title in first data acquisition is filtered and deletes with the target designation obtaining multimedia file, and the first multimedia file title in the first data acquisition is updated to the target designation of multimedia file, to obtain first object data acquisition;
S2, utilize feature phrase to carry out canonical with the second multimedia file title in the second data acquisition to mate, feature phrase in the second multimedia file title in second data acquisition is filtered and deletes with the target designation obtaining multimedia file, and the second multimedia file title in the second data acquisition is updated to the target designation of multimedia file, to obtain the second target data set.
Specifically be described in conjunction with following example, shown in associative list 1, utilize the feature phrase obtained in advance to carry out canonical with the first multimedia file title in the first data acquisition to mate, such as, the first multimedia file title as shown in table 1: " X legend first first collection ", wherein, comprise the feature phrase of " first ", " the first collection ", then after standardized form of Chinese charcters coupling, the feature phrase deleting above-mentioned " first ", " the first collection " will be filtered, to obtain first object data acquisition.Further, shown in associative list 2, to the canonical matching process of the second multimedia file title in the second data acquisition and the canonical matching process of the first data acquisition similar, to obtain the second target data set, the present embodiment does not repeat them here.
By the embodiment that the application provides, by utilizing the feature phrase in characteristic, canonical coupling is carried out to the first data acquisition and the second data acquisition, thus realize carrying out further data cleansing to the first multimedia file title in the first data acquisition and the second multimedia file title in the second data acquisition, so that obtain more original multimedia file title, and then reach the object of the accuracy improving coupling when Data Matching.
As the optional scheme of one, coupling first object data acquisition and the second target data set comprise:
S1, searches the target designation of the multimedia file in the second target data set by the target designation of the multimedia file in first object data acquisition;
S2, mates the program logo bound with the target designation of multimedia file in first object data acquisition with the client identification bound with the target designation of multimedia file in the second data acquisition.
Specifically be described in conjunction with following example, first object data acquisition is obtained after data cleansing is carried out to the first data acquisition (as programme), the second target data set is obtained after data cleansing is carried out to the second data acquisition (as play record), as shown in table 3.
Table 3
From above-mentioned table 3, because the target designation of the multimedia file after cleaning is consistent, the target designation of the multimedia file in first object data acquisition (as programme) then can be utilized to search the target designation of the multimedia file in the second target data set (as play record), and then realize the program logo bound with the target designation of multimedia file in first object data acquisition to mate with the client identification bound with the target designation of multimedia file in the second target data set, obtain content shown in table 4.
Table 4
Program logo Client identification
TV-1 ID-1
TV-1 ID-1
TV-2 ID-2
TV-3 ID-3
By the embodiment that the application provides, by utilizing the target designation of the multimedia file after characteristic being cleaned, program logo in first object data acquisition is mated with the client identification in the second target data set, thus overcome the inaccurate problem of the matching result caused because multimedia file title is inconsistent in prior art, and then realize the effect of the accuracy improving Data Matching.
Further, in order to avoid the coupling of multimedia file title is omitted, can also but be not limited to again mate, wherein, again can be, but not limited to during coupling upgrade cleaning characteristic used, thus improve the accuracy of Data Matching further.
It should be noted that, can perform in the computer system of such as one group of computer executable instructions in the step shown in the process flow diagram of accompanying drawing, and, although show logical order in flow charts, but in some cases, can be different from the step shown or described by order execution herein.
Embodiment 2
According to the embodiment of the present invention, additionally provide a kind of data matching device for implementing above-mentioned data matching method, as shown in Figure 2, this device comprises:
1) acquiring unit 202, for obtaining the first data acquisition and the second data acquisition, at least comprising the first multimedia file title pre-set, at least comprising the second multimedia file title of client terminal playing in the second data acquisition in the first data acquisition;
2) cleaning unit 204, for carrying out data cleansing to obtain first object data acquisition according to predetermined condition to the first data acquisition, and according to predetermined condition, data cleansing is carried out to obtain the second target data set to the second data acquisition, wherein, the characteristic for filtering in the first data acquisition and the second data acquisition is cleaned;
3) the first matching unit 206, for mating first object data acquisition and the second target data set.
Alternatively, in the present embodiment, each column group statistics client that above-mentioned data matching method can be, but not limited to be applied to the network platform is watched in the process of online multimedia file, such as, obtain the programme comprising multimedia file title that each column group pre-sets, and the multimedia file that the user that reports of each client watches multimedia file plays record, then data cleansing is carried out to above-mentioned programme, to obtain original multimedia file title, and data cleansing is carried out to multimedia file broadcasting record, to obtain above-mentioned original multimedia file title, the multimedia file title utilizing this original, the program logo of column group is mated with the client identification of client, thus overcome in prior art due to programme and multimedia file play record in multimedia file title inconsistent caused mate the low problem of accuracy, and then improve the accuracy of Data Matching, further raising is to the accuracy of the broadcasting behavioural analysis of user, so that provide better resource for user.Above-mentioned citing is a kind of example, and the present embodiment is not limited in any way this.
It should be noted that, preset column group mark and the binding relationship of the first multimedia file title in the programme that above-mentioned column group provides, in the broadcasting record that above-mentioned client provides, include the binding relationship of client identification and the second multimedia file title.Such as, to provide programme as shown in table 5 for TV play column group.
Table 5
Further, the data that provide of client are as shown in table 6.
Table 6
Alternatively, in the present embodiment, above-mentioned characteristic can include but not limited to be arranged in the property data base set up in advance, wherein, the type of above-mentioned characteristic can include but not limited to following one of at least: feature string, characteristic key words, feature phrase.Wherein, feature string can include but not limited to punctuation mark, such as, and pause mark, comma, fullstop, punctuation marks used to enclose the title, exclamation etc., again such as, double byte character, half-angle character, Chinese character, English character.Above-mentioned characteristic key words can include but not limited to: for identifying the keyword of playing progress rate, such as, plays date, end etc.Above-mentioned feature phrase can include but not limited to, for distinguishing the phrase of multimedia file of the same name, such as, and first, first season etc.
Further, in the present embodiment, when setting up above-mentioned property data base, the mode obtaining characteristic can include but not limited to: from database, obtain the characteristic set of having preserved, from the first data acquisition and the second data acquisition, detect special characteristic.
Alternatively, in the present embodiment, above-mentioned predetermined condition can include but not limited to carry out data cleansing according to the characteristic in property data base in the first data acquisition and the second data acquisition, thus obtain the first object data acquisition after cleaning and the second datum target set, and then realize mating the client identification in the program logo in the first datum target set and the second datum target set, to obtain the viewership of client to each multimedia file.
Alternatively, in the present embodiment, above-mentioned cleaning can include but not limited to filter and delete, that is, the program that column group is provided but in the broadcasting record that provides of the first multimedia file title and client in the second multimedia file title unnecessary information filtering delete, thus obtain original multimedia file title, avoid the problem that the coupling accuracy that causes because multimedia file title is inconsistent is low.
By the embodiment that the application provides, by carrying out data cleansing to the characteristic in the first data acquisition got and the second data acquisition, thus make the first object data acquisition after cleaning and the data in the second target data set realize accurate match, the problems such as the inconsistent coupling omission caused of multimedia names provided due to both sides in prior art are provided, and then avoid the low problem of Data Matching accuracy owing to adopting the Data Matching mode of prior art to cause, reach the object improving Data Matching accuracy.Further, owing to having carried out data cleansing according to predetermined condition to the first data acquisition and the second data acquisition, thus make the multimedia file title after cleaning more succinct, and then improve Data Matching efficiency, save the time of data analysis.
As the optional scheme of one, said apparatus also comprises:
1) unit is set up, for carrying out data cleansing to obtain first object data acquisition according to predetermined condition to the first data acquisition, and according to predetermined condition, data cleansing is carried out with before obtaining the second target data set to the second data acquisition, set up the property data base of the characteristic comprised for data cleansing, wherein, characteristic at least comprises feature string, characteristic key words.
Alternatively, in the present embodiment, after acquisition first data acquisition and the second data acquisition, before according to predetermined condition data cleansing being carried out to the first data acquisition and the second data acquisition, set up the property data base being used for above-mentioned data cleansing, thus judge that above-mentioned first data acquisition and the second data acquisition are the need of carrying out data cleansing by the characteristic in above-mentioned property data base, if find above-mentioned characteristic, then data cleansing is carried out to the characteristic in above-mentioned first data acquisition and the second data acquisition.
By the embodiment that the application provides, by setting up the property data base of the characteristic comprised for data cleansing in advance, thus realize carrying out data cleansing to the first data acquisition and the second data acquisition fast according to property data base, save the time of data cleansing, and then while improve the speed of data cleansing, also improve the efficiency of Data Matching.
As the optional scheme of one, set up unit comprise following one of at least:
1) first setting up module, for detecting in the first data acquisition and the second data acquisition whether comprise feature string and characteristic key words, and the feature string detected and characteristic key words being added in property data base;
2) second setting up module, for obtaining the characteristic character set of strings and/or feature critical set of words of having preserved in database, and characteristic character set of strings and/or feature critical set of words being added in property data base.
Alternatively, in the present embodiment, the characteristic character set of strings of having preserved in above-mentioned database and/or feature critical set of words can include but not limited to: punctuation mark complete or collected works table, the keyword data set of setting up in advance.That is, existing characteristic character set of strings and/or feature critical set of words are directly added in property data base.
Specifically be described in conjunction with following example, to detect in the first data acquisition and the second data acquisition whether comprise feature string and characteristic key words, by carrying out data characteristics analysis to the first data acquisition and the second data acquisition, the character string special using some and/or keyword as feature string and characteristic key words, and are added in property data base.
By the embodiment that the application provides, by obtaining characteristic from the first data acquisition and the second data acquisition, or, characteristic is obtained from database, and then added to property data base to set up the property data base being used for data cleansing, make the characteristic of acquisition comparatively extensive, comprehensively, thus ensure that the accuracy of data cleansing, avoid because the problem causing affecting Data Matching is omitted in cleaning.
As the optional scheme of one, cleaning unit comprises:
1) first module is searched, for searching whether comprise feature string and/or characteristic key words according in the second multimedia file title in the first multimedia file title of property data base in the first data acquisition and in the second data acquisition;
2) removing module, for when finding feature string and/or characteristic key words, deletes feature string and/or characteristic key words.
Specifically be described in conjunction with following example, shown in associative list 5, search whether comprise feature string in the first multimedia file title in the first data acquisition, such as, find the first multimedia file title as shown in table 5: " Y passes, first ", wherein, comprise punctuation marks used to enclose the title, comma, then by above-mentioned punctuation marks used to enclose the title, comma is deleted.Further, search whether comprise characteristic key words in the first multimedia file title in the first data acquisition, such as, find the first multimedia file title as shown in table 5: " Z remembers second end ", wherein, except special string punctuation marks used to enclose the title, also comprise characteristic key words " end ", thus, above-mentioned characteristic key words " end " need be deleted.Further, shown in associative list 6, to the data cleansing process of the second multimedia file title in the second data acquisition and the data cleansing process of the first data acquisition similar, the present embodiment does not repeat them here.
By the embodiment that the application provides, by utilizing the feature string in characteristic and characteristic key words, data cleansing is carried out to the first data acquisition and the second data acquisition, thus unnecessary characteristic is filtered deletion, to make the first multimedia file title in the first data acquisition after cleaning consistent with the second multimedia file title in the second data acquisition after cleaning, be convenient to further Data Matching.
As the optional scheme of one, also comprise feature phrase in characteristic, wherein, cleaning unit also comprises:
1) the first filtering module, carrying out canonical for utilizing feature phrase with the first multimedia file title in the first data acquisition to mate, the feature phrase in the first multimedia file title in the first data acquisition being filtered and deletes with the target designation obtaining multimedia file;
2) the first update module, for the first multimedia file title in the first data acquisition being updated to the target designation of multimedia file, to obtain first object data acquisition;
3) the second filtering module, carrying out canonical for utilizing feature phrase with the second multimedia file title in the second data acquisition to mate, the feature phrase in the second multimedia file title in the second data acquisition being filtered and deletes with the target designation obtaining multimedia file;
4) the second update module, for the second multimedia file title in the second data acquisition being updated to the target designation of multimedia file, to obtain the second target data set.
Specifically be described in conjunction with following example, shown in associative list 5, utilize the feature phrase obtained in advance to carry out canonical with the first multimedia file title in the first data acquisition to mate, such as, the first multimedia file title as shown in table 5: " X legend first first collection ", wherein, comprise the feature phrase of " first ", " the first collection ", then after standardized form of Chinese charcters coupling, the feature phrase deleting above-mentioned " first ", " the first collection " will be filtered, to obtain first object data acquisition.Further, shown in associative list 6, to the canonical matching process of the second multimedia file title in the second data acquisition and the canonical matching process of the first data acquisition similar, to obtain the second target data set, the present embodiment does not repeat them here.
By the embodiment that the application provides, by utilizing the feature phrase in characteristic, canonical coupling is carried out to the first data acquisition and the second data acquisition, thus realize carrying out further data cleansing to the first multimedia file title in the first data acquisition and the second multimedia file title in the second data acquisition, so that obtain more original multimedia file title, and then reach the object of the accuracy improving coupling when Data Matching
As the optional scheme of one, the first matching unit comprises:
1) second module is searched, for being searched the target designation of the multimedia file in the second target data set by the target designation of the multimedia file in first object data acquisition;
2) matching module, for mating the program logo bound with the target designation of multimedia file in first object data acquisition with the client identification bound with the target designation of multimedia file in the second data acquisition.
Specifically be described in conjunction with following example, first object data acquisition is obtained after data cleansing is carried out to the first data acquisition (as programme), the second target data set is obtained after data cleansing is carried out to the second data acquisition (as play record), as shown in table 7.
Table 7
From above-mentioned table 7, because the target designation of the multimedia file after cleaning is consistent, the target designation of the multimedia file in first object data acquisition (as programme) then can be utilized to search the target designation of the multimedia file in the second target data set (as play record), and then realize the program logo bound with the target designation of multimedia file in first object data acquisition to mate with the client identification bound with the target designation of multimedia file in the second target data set, obtain content shown in table 8.
Table 8
Program logo Client identification
TV-1 ID-1
TV-1 ID-1
TV-2 ID-2
TV-3 ID-3
By the embodiment that the application provides, by utilizing the target designation of the multimedia file after characteristic being cleaned, program logo in first object data acquisition is mated with the client identification in the second target data set, thus overcome the inaccurate problem of the matching result caused because multimedia file title is inconsistent in prior art, and then realize the effect of the accuracy improving Data Matching.
Further, in order to avoid the coupling of multimedia file title is omitted, can also but be not limited to again mate, wherein, again can be, but not limited to during coupling upgrade cleaning characteristic used, thus improve the accuracy of Data Matching further.
Obviously, those skilled in the art should be understood that, above-mentioned of the present invention each module or each step can realize with general calculation element, they can concentrate on single calculation element, or be distributed on network that multiple calculation element forms, alternatively, they can realize with the executable program code of calculation element, thus, they can be stored and be performed by calculation element in the storage device, or they are made into each integrated circuit modules respectively, or the multiple module in them or step are made into single integrated circuit module to realize.Like this, the present invention is not restricted to any specific hardware and software combination.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (12)

1. a data matching method, is characterized in that, comprising:
Obtain the first data acquisition and the second data acquisition, wherein, in described first data acquisition, at least comprise the first multimedia file title pre-set, in described second data acquisition, at least comprise the second multimedia file title of client terminal playing;
According to predetermined condition, data cleansing is carried out to obtain first object data acquisition to described first data acquisition, and according to described predetermined condition, data cleansing is carried out to obtain the second target data set to described second data acquisition, wherein, described cleaning is for filtering the characteristic in described first data acquisition and described second data acquisition;
Mate described first object data acquisition and described second target data set.
2. method according to claim 1, it is characterized in that, according to predetermined condition, data cleansing is carried out to obtain first object data acquisition to described first data acquisition described, and according to described predetermined condition, data cleansing is carried out with before obtaining the second target data set to described second data acquisition, also comprise:
Set up the property data base of the described characteristic comprised for described data cleansing, wherein, described characteristic at least comprises feature string, characteristic key words.
3. method according to claim 2, is characterized in that, described set up property data base comprise following one of at least:
Detect in described first data acquisition and described second data acquisition and whether comprise described feature string and described characteristic key words, and the described feature string detected and described characteristic key words are added in described property data base;
Obtain the characteristic character set of strings and/or feature critical set of words of having preserved in database, and described characteristic character set of strings and/or feature critical set of words are added in described property data base.
4. method according to claim 2, it is characterized in that, describedly according to predetermined condition, data cleansing is carried out to obtain first object data acquisition to described first data acquisition, and according to described predetermined condition, data cleansing is carried out to described second data acquisition and comprise to obtain the second target data set:
Search whether comprise described feature string and/or described characteristic key words according in the described second multimedia file title in the described first multimedia file title of described property data base in described first data acquisition and in described second data acquisition;
If find described feature string and/or described characteristic key words, then delete described feature string and/or described characteristic key words.
5. method according to claim 4, it is characterized in that, also feature phrase is comprised in described characteristic, wherein, describedly according to predetermined condition, data cleansing is carried out to obtain first object data acquisition to described first data acquisition, and according to described predetermined condition, data cleansing is carried out to described second data acquisition and also comprise to obtain the second target data set:
Utilize described feature phrase to carry out canonical with the described first multimedia file title in described first data acquisition to mate, described feature phrase in described first multimedia file title in described first data acquisition is filtered and deletes with the target designation obtaining described multimedia file, and the described first multimedia file title in described first data acquisition is updated to the described target designation of described multimedia file, to obtain described first object data acquisition;
Utilize described feature phrase to carry out canonical with the described second multimedia file title in described second data acquisition to mate, described feature phrase in described second multimedia file title in described second data acquisition is filtered and deletes with the described target designation obtaining described multimedia file, and the described second multimedia file title in described second data acquisition is updated to the described target designation of described multimedia file, to obtain described second target data set.
6. method according to claim 5, is characterized in that, described coupling described first object data acquisition and described second target data set comprise:
The described target designation of the described multimedia file in described second target data set is searched by the described target designation of the described multimedia file in described first object data acquisition;
The program logo bound with the described target designation of described multimedia file in described first object data acquisition is mated with the client identification bound with the described target designation of described multimedia file in described second data acquisition.
7. a data matching device, is characterized in that, comprising:
Acquiring unit, for obtaining the first data acquisition and the second data acquisition, wherein, in described first data acquisition, at least comprise the first multimedia file title pre-set, in described second data acquisition, at least comprise the second multimedia file title of client terminal playing;
Cleaning unit, for carrying out data cleansing to obtain first object data acquisition according to predetermined condition to described first data acquisition, and according to described predetermined condition, data cleansing is carried out to obtain the second target data set to described second data acquisition, wherein, described cleaning is for filtering the characteristic in described first data acquisition and described second data acquisition;
First matching unit, for mating described first object data acquisition and described second target data set.
8. device according to claim 7, is characterized in that, also comprises:
Set up unit, for carrying out data cleansing to obtain first object data acquisition according to predetermined condition to described first data acquisition described, and according to described predetermined condition, data cleansing is carried out with before obtaining the second target data set to described second data acquisition, set up the property data base of the described characteristic comprised for described data cleansing, wherein, described characteristic at least comprises feature string, characteristic key words.
9. device according to claim 8, is characterized in that, described set up unit comprise following one of at least:
First sets up module, for detecting in described first data acquisition and described second data acquisition whether comprise described feature string and described characteristic key words, and the described feature string detected and described characteristic key words are added in described property data base;
Second sets up module, for obtaining the characteristic character set of strings and/or feature critical set of words of having preserved in database, and described characteristic character set of strings and/or feature critical set of words is added in described property data base.
10. device according to claim 8, is characterized in that, described cleaning unit comprises:
First searches module, for searching whether comprise described feature string and/or described characteristic key words according in the described second multimedia file title in the described first multimedia file title of described property data base in described first data acquisition and in described second data acquisition;
Removing module, for when finding described feature string and/or described characteristic key words, deletes described feature string and/or described characteristic key words.
11. devices according to claim 10, is characterized in that, also comprise feature phrase in described characteristic, and wherein, described cleaning unit also comprises:
First filtering module, carrying out canonical for utilizing described feature phrase with the described first multimedia file title in described first data acquisition to mate, the described feature phrase in the described first multimedia file title in described first data acquisition being filtered and deletes with the target designation obtaining described multimedia file;
First update module, for the described first multimedia file title in described first data acquisition being updated to the described target designation of described multimedia file, to obtain described first object data acquisition;
Second filtering module, carrying out canonical for utilizing described feature phrase with the described second multimedia file title in described second data acquisition to mate, the described feature phrase in the described second multimedia file title in described second data acquisition being filtered and deletes with the described target designation obtaining described multimedia file;
Second update module, for the described second multimedia file title in described second data acquisition being updated to the described target designation of described multimedia file, to obtain described second target data set.
12. devices according to claim 11, is characterized in that, described first matching unit comprises:
Second searches module, for being searched the described target designation of the described multimedia file in described second target data set by the described target designation of the described multimedia file in described first object data acquisition;
Matching module, for mating the program logo bound with the described target designation of described multimedia file in described first object data acquisition with the client identification bound with the described target designation of described multimedia file in described second data acquisition.
CN201410766705.4A 2014-12-11 2014-12-11 Data matching method and device Pending CN104504021A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410766705.4A CN104504021A (en) 2014-12-11 2014-12-11 Data matching method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410766705.4A CN104504021A (en) 2014-12-11 2014-12-11 Data matching method and device

Publications (1)

Publication Number Publication Date
CN104504021A true CN104504021A (en) 2015-04-08

Family

ID=52945419

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410766705.4A Pending CN104504021A (en) 2014-12-11 2014-12-11 Data matching method and device

Country Status (1)

Country Link
CN (1) CN104504021A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106372668A (en) * 2016-08-31 2017-02-01 新浪网技术(中国)有限公司 Data matching method and device
CN107193884A (en) * 2017-04-27 2017-09-22 北京小米移动软件有限公司 A kind of method and apparatus of matched data
CN110134801A (en) * 2019-04-28 2019-08-16 福建星网视易信息***有限公司 A kind of matching process and storage medium of work title and multimedia file

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102792298A (en) * 2010-01-13 2012-11-21 起元技术有限责任公司 Matching metadata sources using rules for characterizing matches
CN103473375A (en) * 2013-09-29 2013-12-25 方正国际软件有限公司 Data cleaning method and data cleaning system
CN104182531A (en) * 2014-08-28 2014-12-03 广州金山网络科技有限公司 Video name processing method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102792298A (en) * 2010-01-13 2012-11-21 起元技术有限责任公司 Matching metadata sources using rules for characterizing matches
CN103473375A (en) * 2013-09-29 2013-12-25 方正国际软件有限公司 Data cleaning method and data cleaning system
CN104182531A (en) * 2014-08-28 2014-12-03 广州金山网络科技有限公司 Video name processing method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈洁主编: "《Access数据库与程序设计 第2版》", 31 August 2013, 清华大学出版社 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106372668A (en) * 2016-08-31 2017-02-01 新浪网技术(中国)有限公司 Data matching method and device
CN107193884A (en) * 2017-04-27 2017-09-22 北京小米移动软件有限公司 A kind of method and apparatus of matched data
CN110134801A (en) * 2019-04-28 2019-08-16 福建星网视易信息***有限公司 A kind of matching process and storage medium of work title and multimedia file

Similar Documents

Publication Publication Date Title
US11354356B1 (en) Video segments for a video related to a task
CN106331778B (en) Video recommendation method and device
CN104219575B (en) Method and system for recommending related videos
US8181197B2 (en) System and method for voting on popular video intervals
US8918330B1 (en) Display of videos based on referrers
CN109684513B (en) Low-quality video identification method and device
CN106326431A (en) Information recommendation method and device
US20090094189A1 (en) Methods, systems, and computer program products for managing tags added by users engaged in social tagging of content
CN104333773A (en) A Video recommending method and server
US20150205580A1 (en) Method and System for Sorting Online Videos of a Search
CN106484774B (en) Correlation method and system for multi-source video metadata
CN104615627B (en) A kind of event public feelings information extracting method and system based on microblog
JP2014029709A (en) Technique for restoring program information for clips of broadcast programs shared online
KR101252670B1 (en) Apparatus, method and computer readable recording medium for providing related contents
JP2007102767A (en) Information processor
KR101354721B1 (en) Search system and method of search service
CN111008321A (en) Recommendation method and device based on logistic regression, computing equipment and readable storage medium
CN101764661A (en) Data fusion based video program recommendation system
US20140006430A1 (en) Indexing multimedia web content
US8572073B1 (en) Spam detection for user-generated multimedia items based on appearance in popular queries
WO2018113673A1 (en) Method and apparatus for pushing search result of variety show query
CN103605808A (en) Search-based UGC (user generated content) recommendation method and search-based UGC recommendation system
CN104391978A (en) Method and device for storing and processing web pages of browsers
CN113407773A (en) Short video intelligent recommendation method and system, electronic device and storage medium
CN106874502A (en) A kind of method of video search, device and terminal

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20150408

RJ01 Rejection of invention patent application after publication