CN105117500B - A kind of data query acquisition methods under big data background - Google Patents

A kind of data query acquisition methods under big data background Download PDF

Info

Publication number
CN105117500B
CN105117500B CN201510650312.1A CN201510650312A CN105117500B CN 105117500 B CN105117500 B CN 105117500B CN 201510650312 A CN201510650312 A CN 201510650312A CN 105117500 B CN105117500 B CN 105117500B
Authority
CN
China
Prior art keywords
internet content
segment
entity
data query
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510650312.1A
Other languages
Chinese (zh)
Other versions
CN105117500A (en
Inventor
刘洋
李雪颖
敬皓
代林
张永宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHENGDU SHINE TECHNOLOGY Co Ltd
Original Assignee
CHENGDU SHINE TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHENGDU SHINE TECHNOLOGY Co Ltd filed Critical CHENGDU SHINE TECHNOLOGY Co Ltd
Priority to CN201510650312.1A priority Critical patent/CN105117500B/en
Publication of CN105117500A publication Critical patent/CN105117500A/en
Application granted granted Critical
Publication of CN105117500B publication Critical patent/CN105117500B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of internet content data query acquisition methods and device, the method comprising the steps of:It determines to obtain target and plan to the data query of internet content under big data background;Target and plan are obtained according to determining data query, obtain internet content object to be analyzed;Useful content in inquiry, acquisition internet content;The entity is calculated and summarized, determines and analyze the correctness of useful internet content that verification is inquired and obtained;The above results are further tested, and the above method is modified.This method and its device can meet the challenge in the big data epoch of internet content, it makes full use of internet content information and more effectively meets the interested internet content depth of main body and excavate demand, and improve internet content data query and obtain and accuracy, promptness, efficiency and the speed of mining analysis.

Description

A kind of data query acquisition methods under big data background
Technical field
The present invention relates to electric processing data information fields, more specifically, are related to a kind of data under big data background Method for querying and obtaining and device.
Background technology
With society's industrialization, the continuous improvement of the level of IT application, nowadays data, which have replaced, is calculated as information calculating Center, cloud computing, big data are becoming a kind of trend and trend.Including memory capacity, availability, I/O performances, data safety All various aspects such as property, scalability.Big data is the very huge and complicated data set of scale.Big data has 4V:Volume is (big Amount), data volume increases continuously and healthily;Velocity (high speed), data I/O speed are faster;Variety (various), data type With source diversification;Value (value), there are the usable values of various aspects.
In addition, having benefited from the development of mobile Internet, the scale of internet content data extremely rapid increases.For mutual The research of networking content-data also becomes hot spot of today, inquires, excavates and obtains such as in interested internet content Desired result.More specifically, the information Communications Propensity of media and increasingly huge is included in internet content data The information from media releasing public opinion tendency, the internet public feelings in internet content data are inquired, excavated, are analyzed And acquisition, numerous main bodys are interested or urgently obtain.Although the information under internet big data background is looked into of today The method for asking acquisition and mining analysis has very much, the effect that can more or less utilize the information acquisition ideal.So And these methods cannot be well adapted for the demand of data magnanimity growth, and accurately, in time, efficient high-speed data can not be carried out Processing.
In order to meet the challenge in the big data epoch of internet content, substantially more believed using internet content Breath more effectively meets the interested internet content depth of main body and excavates demand, and improve internet content data query and obtain It takes and the accuracy of mining analysis, promptness, efficiency and speed, can effectively solve the problem that in this field there is an urgent need to one kind above-mentioned The internet content data query acquisition methods of technical problem.
Invention content
An object of the present invention is to provide a kind of internet content data query acquisition methods and its device, passes through the party Method and the device for performing this method, can meet the challenge in the big data epoch of internet content, make full use of internet content Information and more effectively meet the interested internet content depth of main body and excavate demand, and improve internet content data query Acquisition and accuracy, promptness, efficiency and the speed of mining analysis.
The present invention to solve above-mentioned technical problem and the technical solution taken is:A kind of internet content data query obtains Method, including step:It determines to obtain target and plan to the data query of internet content under big data background;According to determining Data query obtain target and plan, obtain internet content object to be analyzed;Inquiry obtains having in internet content Use content;The entity is calculated and summarized, determines and analyze the useful internet content that verification is inquired and obtained Correctness;The above results are further tested, and the above method is modified.
According to another aspect of the present invention, wherein inquiry, the useful content obtained in internet content include following step Suddenly:Internet content object is divided into multiple segments;It selects part or all of in different multiple segments;Calculate the weight of segment Want degree;Segment is ranked up to important assigning degrees, and according to the value;Select one or more that importance value is forward A segment;According to the acquisition of setting rule, important entity is therefrom captured.Internet content object is divided into multiple segments, meter Calculating the significance level of segment can obtain in the following manner:The significance level of each segment and the position where it, the version occupied Face area, font size and font, the factor of the color shown are related.It calculates the significance level of segment and its assignment can be passed through Formula obtains.
According to a further aspect of the invention, a kind of device for performing step in the above method is provided.
Description of the drawings
In the accompanying drawings showing the embodiment of the present invention by way of example rather than by way of limitation, wherein:
According to an embodiment of the invention, Fig. 1 instantiates a kind of data based on internet content under big data background and looks into Ask the flow chart of acquisition methods.
According to an embodiment of the invention, Fig. 2 illustrates the flow chart of useful content in inquiry internet content.
Specific embodiment
In the following description, refer to the attached drawing and several specific embodiments are shown by way of illustration.It will be appreciated that: It is contemplated that and other embodiment can be made without departing from the scope of the present disclosure or spirit.Therefore, it is described in detail below should not be by Think in a limiting sense.
According to an embodiment of the invention, Fig. 1 instantiates a kind of data based on internet content under big data background and looks into Ask the flow chart of acquisition methods.
First, " useful " as described herein (internet) content, typically refers to the interested content of aforementioned body, Huo Zheguan The content of note or it is interested with aforementioned body, want, desired, required and/or concern content it is related and/or Associated content.Must it is emphasized that:So-called useful content may be useful for some main bodys, but Same content is useless for other main bodys.Aforementioned body can be personal or mechanism, tissue or energy It is enough automatically, mechanically, electrically or otherwise perform data processing machine (such as computer, processor, ASIC, SoC), mechanism, logic, virtual bench, entity apparatus, component, equipment or software, program etc..It is enumerated above only It is example, does not represent and this paper and its scope of the claims are limited to the example.This method described in detail below Step.
First, in step sl, it determines to obtain target and meter to the data query of internet content under big data background It draws.Because different data have different characteristics, characteristic and/or attribute, such as in internet environment, social media it is big Data are based on interpersonal interaction;The big data of military news implies or has concentrated military issue weapons or the number of military trend According to;The big data of social news reflects spin and including the consciousness tendency from media releasing personnel;For some state The big data of the technical news of family, area or research institution contains its research emphasis, personnel and Financing Disposition, output effect Rate, possible application range and leading action/influence on research and application field, etc..For these contexts, need There is the inquiry for different internet content data to obtain requirement and plan, so as to enhance the needle that big data inquiry obtains To property and accuracy, the accuracy that the inquiry after being obtains establishes solid foundation.
Secondly, in step s 2, target and plan are obtained according to determining data query, obtained in internet to be analyzed Hold object.The internet content object can be the anything for including internet content, such as, but not limited to be to have to know The Internet picture of malapropism symbol, webpage, Web page picture etc..Preferably, the internet content object be preferably webpage or by The data text that webpage preserves.
Again, in step s3, it inquires, the useful content in acquisition internet content.According to an embodiment of the invention, scheme 2 illustrate the flow chart of useful content in inquiry internet content.Specifically, in above-mentioned steps S3, inquiry obtains interconnection Useful content in net content includes the following steps:Internet content object is divided into multiple segments by S31;S32, selection are different Multiple segments in it is part or all of;S33 calculates the significance level of segment;S34, to important assigning degrees, and according to the value And segment is ranked up;S35, the one or more segments for selecting importance value forward;S36 is advised according to the acquisition of setting Then, important entity is therefrom captured.In above-mentioned steps, in wherein step S34, such as can be based on a certain threshold value come to important Assigning degrees such as can assign the value more than zero, if waited higher than the significance level of the threshold value using certain criterion In or less than the threshold value, then zero is assigned a value of.In step s 35, the highest top n piece of importance value is selected as needed Section, wherein N is positive integer.In step S36, the entity can be character, picture etc..Wherein character can be various language Word, letter, word, phrase, long sentence, short sentence, number etc..Further, if it is necessary, may also include step S37, Can be as needed, important entity is decomposed.Step 37 by entity specifically, further decompose into multiple elements, more It is added in such as between a element!,@, #, $, % ..., &, *, (),,, [,] ,/and arbitrary Arabic numerals etc word Symbol, so as to continue to execute a step S36.The purpose for the arrangement is that in order to prevent some entities are covered up to escape to quick Feel the filtering of entity.
Further, in step S33, calculating the significance level of segment can obtain in the following manner.For word For the webpage of form, the information of each section is different, and significance level is also.Typically, title bar is than lower section Website geographical address and number are more important.So the significance level of the latter segment is significantly lower than the former, identify by doing so, Promptness, efficiency can be improved, improves the speed that inquiry obtains.Therefore, the significance level of each segment and the position (weight where it The position for wanting content is general forward top), the space of a whole page area (space of a whole page area of occupancy is bigger, and significance level is higher), the word that occupy Body size (the big significance level of font is often high) and font (the often significance level of overstriking is high), the color of display are (red past It is high toward instruction significance level) etc. factors it is related.It similarly, can be according to factors above come by internet content pair in step S31 As being divided into multiple segments.
Alternatively, it in step S33 and S34, calculates the significance level of segment and its assignment can be obtained by the following formula It takes.For n-th of segment, importance value can be Cn, and value is between zero and one.Its Middle M is segments, and α, β, γ are constants, and A=(L-e)/(eL-e), wherein L are the quantity of entity in segment, and e is segment The quantity of middle different entities;B=mo/M, wherein mo are the quantity of associated entity in segment; Wherein mp is the quantity of the title of entity, and U is the occurrence number of the title of entity in segment.Preferably, alpha+beta+γ=1;It is excellent Selection of land, α:β:γ=1:1:1;Preferably, α:β:γ=3:2:1.
Again, in step s 4, the entity is calculated and is summarized, determined and analyze what verification was inquired and obtained The correctness of useful internet content.This step is equivalent to result and calculates and verification step, can according to conventionally calculation, summarize and test Step is demonstrate,proved to perform, to ensure the integrality of result and correctness.
Again, in step s 5, the above results are further tested, and the above method is modified.The step The rapid part for being equivalent to feedback mechanism, the correctness obtained the purpose is to further improve inquiry.
By handling above, internet content data query acquisition methods and its device can meet the big of internet content The challenge of data age makes full use of internet content information and more effectively meets the interested internet content depth of main body and dig Pick demand, and improve accuracy, promptness, efficiency and the speed of the acquisition of internet content data query and mining analysis.
It will be appreciated that:The example and reality of the present invention can be realized in the form of the combination of hardware, software or hardware and software Apply example.As described above, any main body for performing this method can be stored, in the form of volatile or non-volatile storage, such as Storage device, as ROM, no matter it is erasable or rewritable whether or in the form of a memory, such as RAM, storage core Piece, equipment or integrated circuit or on the readable medium of light or magnetic, such as CD, DVD, disk or tape.It will be appreciated that: Storage device and storage medium are suitable for storing the example of the machine readable storage of one or more programs, upon being performed, One or more of programs realize the example of the present invention.Via any medium, such as connect what is be loaded with by wired or wireless Signal of communication can electronically transmit the example of the present invention, and example suitably includes identical content.
It should be noted that:Because the present invention solves the problems, such as techniques discussed above, employ computer and communication is led Technical staff can instruct technological means to understand, and obtain the skill according to it after reading this description in domain Art effect, so claimed scheme belongs to the technical solution on patent law purposes in the following claims.In addition, because The technical solution being claimed for appended claims can be made or used in industry, therefore the technical solution has practicality Property.
The foregoing is only a preferred embodiment of the present invention, but protection scope of the present invention be not limited thereto, Any one skilled in the art in the technical scope disclosed by the present invention, the change or replacement that can be readily occurred in, Should all it forgive within protection scope of the present invention.Unless being otherwise expressly recited, otherwise disclosed each feature is only Equivalent or similar characteristics a example for general series.Therefore, protection scope of the present invention should be with the guarantor of claims It protects subject to range.

Claims (8)

1. a kind of internet content data query acquisition methods, it is characterised in that include the following steps:
S1:It determines to obtain target and plan to the data query of internet content under big data background;
S2:Target and plan are obtained according to determining data query, obtain internet content object to be analyzed;
S3:Useful content in inquiry, acquisition internet content;
S4:Entity is calculated and is summarized, determines and analyze the correct of the useful internet content that verification is inquired and obtained Property;And
S5:The above results are further tested, and the above method is modified;
The useful content wherein inquire, obtained in internet content includes the following steps:
Internet content object is divided into multiple segments by S31;
S32 is selected part or all of in different multiple segments;
S33 calculates the significance level of segment;
S34 is ranked up segment to important assigning degrees, and according to the value;
S35, the one or more segments for selecting importance value forward;And
S36 according to the acquisition of setting rule, therefrom captures important entity;
Wherein in step S33 and S34, calculate the significance level of segment and its assignment can be obtained by the following formula:For N-th of segment, importance value Cn, value is between zero and one;Wherein M is piece Hop count, α, β, γ are constants, and A=(L-e)/(eL-e), wherein L are the quantity of entity in segment, and e is different real in segment The quantity of body;B=mo/M, wherein mo are the quantity of associated entity in segment;Wherein mp is The quantity of the title of entity, and U is the occurrence number of the title of entity in segment.
2. the method as described in claim 1, wherein the different characteristic, characteristic and/or the attribute that have for different data are come really Determine target and the plan that the data query of internet content obtains;The internet content object is the interconnection for having recognizable character Net picture, webpage, Web page picture or the data text preserved by webpage.
It 3. can be based on a certain threshold value come to important assigning degrees in the method as described in claim 1, wherein step S34.
4. the method as described in claim 1 further comprises step S37, wherein entity is further decomposed into multiple elements, Character is added between multiple elements, so as to continue to execute a step S36.
5. the method as described in claim 1, wherein in step S31, S33, by internet content object be divided into multiple segments, Calculating the significance level of segment can obtain in the following manner:The significance level of each segment with where it position, occupy Space of a whole page area, font size and font, the factor of the color shown are related.
6. the method as described in claim 1, wherein alpha+beta+γ=1, and α:β:γ=1:1:1.
7. the method as described in claim 1, wherein alpha+beta+γ=1, and α:β:γ=3:2:1.
8. a kind of system of internet content data query acquisition methods being used to implement described in any one of claim 1-7, Each device including being used to implement each step.
CN201510650312.1A 2015-10-10 2015-10-10 A kind of data query acquisition methods under big data background Active CN105117500B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510650312.1A CN105117500B (en) 2015-10-10 2015-10-10 A kind of data query acquisition methods under big data background

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510650312.1A CN105117500B (en) 2015-10-10 2015-10-10 A kind of data query acquisition methods under big data background

Publications (2)

Publication Number Publication Date
CN105117500A CN105117500A (en) 2015-12-02
CN105117500B true CN105117500B (en) 2018-07-06

Family

ID=54665488

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510650312.1A Active CN105117500B (en) 2015-10-10 2015-10-10 A kind of data query acquisition methods under big data background

Country Status (1)

Country Link
CN (1) CN105117500B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1677403A (en) * 2004-03-22 2005-10-05 微软公司 System and method for automated optimization of search result relevance
CN101131636A (en) * 2006-08-18 2008-02-27 李颖 On-line voice or Pinyin input method
CN101154228A (en) * 2006-09-27 2008-04-02 西门子公司 Partitioned pattern matching method and device thereof
CN101237465A (en) * 2007-01-30 2008-08-06 中国科学院声学研究所 A webpage context extraction method based on quick Fourier conversion
CN104503991A (en) * 2014-12-03 2015-04-08 百度在线网络技术(北京)有限公司 Information searching method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030172381A1 (en) * 2002-01-25 2003-09-11 Koninklijke Philips Electronics N.V. Digital television system having personalized addressable content

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1677403A (en) * 2004-03-22 2005-10-05 微软公司 System and method for automated optimization of search result relevance
CN101131636A (en) * 2006-08-18 2008-02-27 李颖 On-line voice or Pinyin input method
CN101154228A (en) * 2006-09-27 2008-04-02 西门子公司 Partitioned pattern matching method and device thereof
CN101237465A (en) * 2007-01-30 2008-08-06 中国科学院声学研究所 A webpage context extraction method based on quick Fourier conversion
CN104503991A (en) * 2014-12-03 2015-04-08 百度在线网络技术(北京)有限公司 Information searching method and device

Also Published As

Publication number Publication date
CN105117500A (en) 2015-12-02

Similar Documents

Publication Publication Date Title
US9819568B2 (en) Spam flood detection methodologies
US9773010B1 (en) Information-driven file system navigation
US20150096041A1 (en) Identifying and ranking pirated media content
US9177249B2 (en) Scientometric methods for identifying emerging technologies
US20130198240A1 (en) Social Network Analysis
US9697287B2 (en) Detection and handling of aggregated online content using decision criteria to compare similar or identical content items
CN105894183A (en) Project evaluation method and apparatus
CN106991090A (en) The analysis method and device of public sentiment event entity
CN105117489B (en) Database management method and device and electronic equipment
US20200293685A1 (en) Guided redaction systems and methods
CN105808627A (en) POI information updating method and device, POI information search method and device as well as POI data packet generation method and device
Benjamin et al. Evaluating text visualization for authorship analysis
CN106202349A (en) Web page classifying dictionary creation method and device
Font-Julián et al. Which types of online resource support US patent claims?
CN105117500B (en) A kind of data query acquisition methods under big data background
CN106293671B (en) Method and device for generating component template
JP2020123321A (en) Method and apparatus for search processing based on clipboard data
JP5963310B2 (en) Information processing apparatus, information processing method, and information processing program
US10324589B2 (en) Navigation by usage locations in a document editing application
CN109101406A (en) The generation method and device of response type page thermodynamic chart a little are buried based on front end
JP2012155681A (en) Font server
CN105260425A (en) Cloud disk based file display method and apparatus
Jones et al. Automatically selecting striking images for social cards
CN104572620B (en) A kind of method and apparatus for showing chapters and sections content
Agrawal et al. A comparative analysis of social networking analysis tools

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A data query method in the context of big data

Effective date of registration: 20210412

Granted publication date: 20180706

Pledgee: The Agricultural Bank of Chengdu branch of Limited by Share Ltd. Chinese Sichuan

Pledgor: SHINE TECHNOLOGY Co.,Ltd.

Registration number: Y2021980002529

PE01 Entry into force of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20220424

Granted publication date: 20180706

Pledgee: The Agricultural Bank of Chengdu branch of Limited by Share Ltd. Chinese Sichuan

Pledgor: SHINE TECHNOLOGY Co.,Ltd.

Registration number: Y2021980002529

PC01 Cancellation of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A data query and acquisition method under the background of big data

Effective date of registration: 20220505

Granted publication date: 20180706

Pledgee: CHENGDU RURAL COMMERCIAL BANK CO.,LTD.

Pledgor: SHINE TECHNOLOGY Co.,Ltd.

Registration number: Y2022510000118

PE01 Entry into force of the registration of the contract for pledge of patent right