CN107590125B - A kind of big data text real-time interaction method and device based on random algorithm - Google Patents

A kind of big data text real-time interaction method and device based on random algorithm Download PDF

Info

Publication number
CN107590125B
CN107590125B CN201710802384.2A CN201710802384A CN107590125B CN 107590125 B CN107590125 B CN 107590125B CN 201710802384 A CN201710802384 A CN 201710802384A CN 107590125 B CN107590125 B CN 107590125B
Authority
CN
China
Prior art keywords
text
data
big data
big
data source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710802384.2A
Other languages
Chinese (zh)
Other versions
CN107590125A (en
Inventor
管荑
田大伟
王启龙
李鸿奎
刘春秀
高军
刘勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Shandong Electric Power Co Ltd
Original Assignee
State Grid Shandong Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Shandong Electric Power Co Ltd filed Critical State Grid Shandong Electric Power Co Ltd
Priority to CN201710802384.2A priority Critical patent/CN107590125B/en
Publication of CN107590125A publication Critical patent/CN107590125A/en
Application granted granted Critical
Publication of CN107590125B publication Critical patent/CN107590125B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention is claimed a kind of big data text real-time interaction method and device based on random algorithm and is pre-processed using random algorithm to big text data to be processed;The data source of big data text is loaded, big data text real-time, interactive Field Options are generated;Big data text real-time, interactive Field Options based on generation, are adjusted big data content of text, complete big data text real-time, interactive, carry out big data text semantic query analysis.More existing text big data analysis frame, data prediction and data are analyzed while being carried out by the present invention, in practical applications can quick response interaction request, the data block including more accurate estimated data's row (record) in several, quick locating and displaying interactive interface.

Description

A kind of big data text real-time interaction method and device based on random algorithm
Technical field
The invention belongs to big data analysis technical fields to have for interacting real-time analysis to big data content of text Body is related to a kind of big data text real-time interaction method and device based on random algorithm.
Background technique
Big data has the characteristics that flood tide, growth rate are fast, variation is big, value is big.Industry specific for one or tool For body application, data complete and careful or its density and locality are highly useful.
A kind of extension that big data processing is exactly pervious data processing on more on a large scale in fact.Acquisition, parallel crawler, Storage, cloud platform, can do various statistics in this technology, as long as there is scale, data statistics is significant, system Meter method is exactly excavated based on large-scale data, including cluster, classification, prediction, analysis etc., the result finally presented It exactly visualizes, i.e., is showed with various figures.Certainly further include natural language processing, when it come to arrive data sense Analysis it is necessary to using natural language processing technique.
The fundamental significance of text big data analysis is to extract valuable information from data, especially structuring number According to industry has done a large amount of accumulation, has had been provided with considerable skill for the acquisition of data, storage, processing, retrieval etc. Art deposit.But the interactive frame that can be analyzed in real time is not provided in existing text big data analysis frame or scheme.Greatly The text analyzing of data also lacks more efficient, careful analysis means, and user cannot obtain excellent from big data textual resources Change the text data content of processing.
Summary of the invention
The present invention is to solve the problems, such as above-mentioned be previously mentioned and propose, its purpose is to provide a kind of based on random algorithm Big data text real-time interaction method, device and system can optimize analysis to big data text, be based on user The semantic analysis of natural language processing.
The purpose of the present invention is not limited thereto, and for unmentioned other purposes, those skilled in the art can pass through It records and is expressly understood that below.
The present invention protects a kind of big data text real-time interaction method based on random algorithm first, it is characterised in that packet It includes:
Step 1: using random algorithm, big text data to be processed is pre-processed;
Step 2: the data source of load big data text generates big data text real-time, interactive Field Options;
Step 3: the big data text real-time, interactive Field Options based on generation are adjusted big data content of text, Big data text real-time, interactive is completed, big data text semantic query analysis is carried out.
Preferably, it is specifically included in the step 1: establishing the standard comparison data text for being used for text cluster, complete packet The Text Pretreatment operation including the operations such as text participle and removal stop words is included, accordingly text big data is calculated using random Method is divided into N number of data block, and each sub thread saves the line number of scanning and coordinate into index data, and main thread is according to index data Sequence reads data again.
Preferably, it is specifically included in the step 2: selecting the data source of big data text to be loaded, acquisition has loaded Big data text data source, therefrom extract data load progress, file size, file line number, current line number and Field Count etc. Big data the text field project.
Preferably, the step 3 include specifically include: user to data load progress, file size shown in interface, Big datas the text field project such as file line number, current line number and Field Count is adjusted, specification big data text semantic structure, The semantic normalization model for building query text structure involved in big data text query analytic process, establish command analysis with Inquiry business flow model, query process control and feedback result.
Meanwhile the present invention also protects a kind of big data text real-time, interactive device based on random algorithm, it is characterised in that Include:
Preprocessing module: random algorithm is used, big text data to be processed is pre-processed;
Text data loading module: the data source of load big data text generates the choosing of big data text real-time, interactive field ;
Text data interactive module: the big data text real-time, interactive Field Options based on generation, in big data text Appearance is adjusted, and completes big data text real-time, interactive, carries out big data text semantic query analysis.
Preferably, preprocessing module specifically further includes carrying out cluster preprocessing first based on big data content of text, is established For the standard comparison data text of text cluster, the text completed including text segments and removes the operations such as stop words is pre- Text big data is accordingly divided into N number of data block using random algorithm by processing operation, and each sub thread is by the line number and seat of scanning Mark is saved into index data, and main thread sequentially reads data according to index data again.
Preferably, text data loading module is specific further include: text data loading module selects big data to be loaded The data source of text, text data loading module load the data source of big data text, and the acquisition of text data loading module has added Data load progress, file size, file line number, current line number and Field Count are therefrom extracted in the big data text data source of load Etc. big datas the text field project.
Preferably, text data interactive module specifically further includes user to data load progress, file shown in interface Big datas the text field project such as size, file line number, current line number and Field Count is adjusted;Text data interactive module rule Model big data text semantic structure builds the semantic rule of query text structure involved in big data text query analytic process Model model, text data interactive module establish instruction parsing and inquiry business flow model, query process control and feedback result.
In addition, the present invention also protects a kind of big data text real-time interaction system based on random algorithm, for realizing institute The big data text real-time interaction method based on random algorithm stated characterized by comprising
Local client: for storing local big data text data source, with local text data interactive system platform into The docking of row data;
Server end: operation local data interactive system platform, for receiving local big data text data source or remote The big data text data source of journey transmission;
Network exchange platform: using File Transfer Protocol, builds webservice, completes the transmission in big data text data source.
Contained in the analysis interactive interface of text big data analysis Interactive interface designing figure of the invention data load into Degree, file size, file line number, current line number and Field Count, and the key message of five data analysis;Using random algorithm Thought more accurately estimates the total line number of file;Using the mechanism of multithreading, the creation of index file is realized.
More existing text big data analysis frame, data prediction and data are analyzed while being carried out by the present invention, in reality In can quick response interaction request, including more accurate estimated data's row (record) is several, quick locating and displaying interactive interface In data block.
Detailed description of the invention
It is included to provide the attached drawing further recognized to published subject, this specification will be incorporated into and constitute this and said A part of bright book.Attached drawing also illustrates the realization of published subject, and disclosed for explaining together with detailed description The realization principle of theme.It is not attempt to show to be more than the knot needed to the basic comprehension of published subject and its a variety of practice modes Structure details.
Fig. 1 is the text big data interactive interface display figure that this method is related to.
Fig. 2 is the text big data pre-processing structure figure that this method is related to.
Fig. 3 is the business process map that this method is related to.
Fig. 4 is the system architecture diagram that the system is related to.
Specific embodiment
Advantages of the present invention, feature and reach the method for the purpose will be bright by attached drawing and subsequent detailed description Really.
Present invention firstly relates to a kind of big data text real-time interaction method based on random algorithm, can be to big data text Originally analysis is optimized, carries out the semantic analysis based on natural language processing with user, this method counts big data text According to the identification in source, the content and structure of the text data of the data source for the condition that meets is predicted and extracted, so as to subsequent Text data is used interchangeably.
This method mainly includes three steps, referring to attached drawing 3, the work flow diagram of this method, comprising:
Step 1: using random algorithm, big text data to be processed is pre-processed.
Step 2: the data source of load big data text generates big data text real-time, interactive Field Options.
Step 3: the big data text real-time, interactive Field Options based on generation are adjusted big data content of text, Big data text real-time, interactive is completed, big data text semantic query analysis is carried out.
This method to the effect that completes being used interchangeably for data, is carried out based on big data content of text by user semantic Analysis, therefrom extract accurate text structure frame, text be split based on Field Options, then further by user into The adjustment and modification of row big data text.
Preferably, in the step 1, using random algorithm, carrying out pretreatment to big text data to be processed includes:
Step 1.1: cluster preprocessing being carried out based on big data content of text first, establishes the standard ratio for being used for text cluster To data text, the Text Pretreatment operation including text segments and removes the operations such as stop words is completed.
It wherein, is experiment text and debugging text by the standard comparison data text random division, by the standard comparison Data text, which establishes text data using vector space model, indicates model.
Indicate that model carries out rare feature selecting to the standard comparison data text according to above-mentioned text data, using mark Descending arranges and extracts all rare words greater than second threshold after all Feature Words calculate in quasi- comparison data text, generates One rare vocabulary.
Standard comparison data text training is learnt using probabilistic classifier, obtains cluster result, experiment text and Debugging text is directed to the rare vocabulary for extracting obtain before and carries out text vector, finds out and is included in rare word in each file After the word frequency of each word in remittance table, and reverse document-frequency value, calculate word frequency × reverse file frequency of each word Rate value, and calculated result is saved.
Step 1.2: calculating the ability of resource according to current system, initialize respective thread number N (N > 1), including one Then text big data is accordingly divided into N number of data block using random algorithm, Chinese by a main thread and N-1 sub thread This big data size is S.
System carries out the initial work of main thread and sub thread according to the computing resource of current system, specifically includes that and sentences Whether the current system operating rate that breaks is more than or equal to third threshold value, if current system operating rate is more than or equal to third threshold Value, then carry out the initial work of main thread and sub thread, and otherwise, the variation of waiting system rate is until it is more than or equal to third Threshold value;And/or judge whether current system operation load is more than or equal to the 4th threshold value, if current system operation load is small In being equal to the 4th threshold value, then the initial work of main thread and sub thread is carried out, otherwise, the variation of waiting system rate is up to it Less than or equal to the 4th threshold value.
Referring to attached drawing 2, big data text, which is averaged, to be divided into several text blocks (example is four in Fig. 2, can be according to reality The free number segmentation of border situation), the data block initialization thread one to four according to segmentation, wherein default thread one is main thread, It is responsible for also safeguarding interaction between thread except the basic read work of text block one, thread two is born to fourth is that belong to sub thread Duty manages the basic read work of respective text block.
Step 1.3: i-th of sub thread thread scan line (record) coordinate, main thread since the position (i-1) * S/N is random Selection summarizes the size of sub thread scanning N row (record), estimates the line number of text big data.
Text block after segmentation is read out scanning by corresponding sub thread, and text block internal data is read out while estimating The line number of text block is calculated, and whole big data text is estimated.
Step 1.4: each sub thread saves the line number of scanning and coordinate into index data, and main thread is according to index data Sequence reads data again.
The row coordinate for the text block that system is read according to each thread establishes index data, wherein sub thread is for respective The row coordinate of text block establish index data, main thread is established except index data in the row coordinate of itself text block.Also need Respective index data is sequentially connected arrangement to read, forms big data content of text.
Preferably, the step 2: the data source of load big data text generates the choosing of big data text real-time, interactive field Include:
Step 2.1: selecting the data source of big data text to be loaded, wherein the data source of big data text includes this Ground big data text data source and/or long-range cloud big data text data source.
Wherein, local data source can be with volatile memory or nonvolatile memory, or may include volatibility and non- Both volatile memory.Wherein, nonvolatile memory can be read-only memory, programmable read only memory, erasable Programmable read only memory.Volatile memory can be random access memory, be used as External Cache.Long-range cloud Big data text data source is mainly based upon the cloud storage device that cloud platform framework is built.
Step 2.2: the data source of load big data text, if it is big data text data source, then according to big data text The path searching that is locally stored in notebook data source reads big data text data source;If it is long-range cloud big data text data Source, first looks for the remote web server of text data source, establishes webservice network service later, transmits big data Text data source.
Step 2.3: obtaining the big data text data source loaded, therefrom extract data load progress, file size, text Big datas the text field project such as part line number, current line number and Field Count.
Referring to attached drawing 1, the text data of system extracts surface chart, in the figure, text big data can be edited, View and help operate, and user needs to select online data source according to Local or Remote, after clicking load data, by textual data According to basic content ranks show that lower section shows statistical data analysis, including file size, file line number, Field Count and work as Preceding line number, wherein the data of the statistical analysis of lower section are completed using random algorithm.
Preferably, if it is long-range cloud big data text data source, the telecommunication network of text data source is first looked for Server, establishes webservice network service later, and transmission big data text data source includes:
Step 2.2.1: when transmitting long-range big data text data source, current system operating status need to be judged, when system When operation energy consumption ratio is more than first threshold, stop the teletransmission in big data text data source.
User mainly needs to judge the service condition of current system operation energy consumption ratio, and wherein energy consumption is than being mainly current system Operating rate and operation load equilibrium value.
Step 2.2.2: when after a period of time, if the operation energy consumption ratio of system is no more than first threshold, starting Data source breakpoint transfer mechanism.
Step 2.2.3: the transmission in response to the big data text data source of user is requested, and identifies the big of data source to be obtained Data text identification information.
Step 2.2.4: it is determined based on the big data Text Flag information for obtaining data source to be obtained described to be sent big Whether data text data source is that breakpoint restarts data source, and it is comprising multiple large data files that the breakpoint, which restarts data source, Data source block, and a portion data source block has become function upload and remaining data source block is not completed and uploaded.
If big data text data to be sent source is that breakpoint as described above restarts data source, this is to be passed The upload to server has been completed in a part of large data files data source block in big data text data source sent, and remaining Part large data files data source block is not yet completed to upload.In order to continue to upload the part of unfinished upload, need first really Upload is partially completed in which in the fixed big data text data to be sent source, which part, which does not complete, uploads, breakpoint mark For describing this location information.It in some scenarios, can be by each big data in big data text data to be sent source File data source block is considered as a sequence, and in the sequence, it is to be sent that the blocks of files before breakpoint mark can be considered as this Big data text data source is completed the large data files data source block of upload, and remaining blocks of files can then be considered as it is unfinished The blocks of files of upload
Step 2.2.5:, will be described if the big data text data to be sent source is not that breakpoint restarts data source It is cut into plurality of data source block to big data text data to be sent source, and uploads the big data textual data to server According to each data source block in source.
If big data text data to be sent source is not that breakpoint restarts data source, based on attribute information create with to The one-to-one condition code in the big data text data source of transmission, each data based on big data text data to be sent source The matching of source block and condition code stores each data source block in big data text data to be sent source to predetermined storage area. Herein, condition code for example can be corresponds with big data text data to be sent source, and can be used for characterizing should be to upper The mark of the identity of transmitting file.If the data source block that server receives is matched with this feature code, the data source may indicate that Block belongs to the big data text data to be sent source, which can be stored the designated storage area into server Domain.
Preferably, step 3: the big data text real-time, interactive Field Options based on generation, to big data content of text into Row adjustment, completes big data text real-time, interactive, carries out big data text semantic query analysis and includes:
Step 3.1: user is to data load progress, file size shown in interface, file line number, current line number and word Big datas the text field project such as number of segment is adjusted, in conjunction with the runnability of system, so that system operation energy consumption adjusted Than being not higher than first threshold;
Step 3.2: specification big data text semantic structure is built and is looked into involved in big data text query analytic process The semantic normalization model for asking text structure, the semantic description standard by the way that multi-level specification is arranged obtain the semanteme of each text structure Description degree, semantic normalization model include that content of text normative model TCRM, text query ordering norms model SCRM, text are looked into Inquiry mode normative model SMRM and text query Business Stream normative model FWRM;
Step 3.3: establishing command analysis and inquiry business flow model, query process control and feedback result.
By filtering inquiry mode model, and the historical query mode model selection of successful inquiring is combined to meet Business Stream The querying method model of each link demand.Each link candidate query mode model is constructed and combined to realize that order is correctly looked into The strategy and rule of inquiry.
Assessment inquiry confidence level, establishes credit appraisal system, is commented by query history each querying method model Estimate its confidence level to different type order.Inquiry business chain is built, is constructed according to inquiry business stream by used issuer The inquiry business chain that method model is constituted.Calculate the result credibility in each Business Stream stage on inquiry business chain.It calculates complete Chain entirety confidence level, and the whole Ranking evaluation of the confidence level of full-service chain is carried out, the highest result of confidence level is fed back to User.
Preferably, step 1: random algorithm is used, before pre-processing to big text data to be processed further include:
Step 0: data synchronization unit being set in internal system, according to the data method of synchronization that may be taken, internal system A variety of data synchronization units are set, as full dose updates the synchronous or incremental update method of synchronization.
The method of synchronization that the data synchronization unit is taken may also include following several:
1) data are carried out based on database bottom to synchronize;
2) data are initiated as centre access medium using XML file to synchronize;
3) processing data are initiated using POP3 agreement to synchronize;
4) data are initiated using Webservice proxy call to synchronize.
When discovery increases new data synchronization unit and system may use the device, need synchronous in data Increase the data synchronization unit for supporting which in device.Certainly, if discovery system will not be same using a kind of existing data again Device is walked, the data synchronization unit of which can also will be supported to delete from data synchronization unit.
Preferably, step 3: the big data text real-time, interactive Field Options based on generation, to big data content of text into Row adjusts, completion big data text real-time, interactive, after progress big data text semantic query analysis further include:
Step 4: user's Local or Remote data source log-on data synchronization request, for data to be synchronized, user is local Or the synchronous dress of the data of the data method of synchronization is supported in remote data source data method of synchronization according to used by system, inquiry It sets, which is sent to the data synchronization unit inquired, user's Local or Remote data source is synchronous by the data Request protocol is sent to system.
Step 5: after user is adjusted big data content of text and carries out semantic query analysis to big data text The data for completing Local or Remote are synchronous, and the content after the data synchronization of Local or Remote is adjusted and inquired with user carries out same Step updates.
A kind of big data text real-time, interactive device based on random algorithm is also claimed in the present invention, it is characterised in that packet It includes: preprocessing module: using random algorithm, big text data to be processed being pre-processed;Text data loading module: add The data source of big data text is carried, big data text real-time, interactive Field Options are generated;Text data interactive module: based on generation Big data text real-time, interactive Field Options, big data content of text is adjusted, complete big data text real-time, interactive, Carry out big data text semantic query analysis.
Preferably, the preprocessing module: random algorithm is used, pretreatment packet is carried out to big text data to be processed It includes:
Preprocessing module is based on big data content of text and carries out cluster preprocessing first, establishes the standard for being used for text cluster Comparison data text completes the Text Pretreatment operation including text segments and removes the operations such as stop words;According to current The ability of system resources in computation initializes respective thread number N (N > 1), including a main thread and N-1 sub thread, so Text big data is accordingly divided into N number of data block using random algorithm afterwards, wherein text big data size is S;I-th of sub-line Journey thread scan line (record) coordinate since the position (i-1) * S/N, main thread random selection summarize sub thread scanning N row (note Record) size, estimate the line number of text big data;Each sub thread saves the line number of scanning and coordinate into index data, Main thread sequentially reads data according to index data again.
Preferably, text data loading module: the data source of load big data text generates big data text real-time, interactive Field Options include:
Text data loading module selects the data source of big data text to be loaded, wherein the data of big data text Source includes local big data text data source and/or long-range cloud big data text data source;The load of text data loading module Road then is locally stored according to big data text data source if it is big data text data source in the data source of big data text Diameter, which is searched, reads big data text data source;If it is long-range cloud big data text data source, this article notebook data is first looked for The remote web server in source establishes webservice network service later, transmits big data text data source;Text data adds Carry module and obtain the big data text data source that has loaded, therefrom extract data load progress, file size, file line number, when Big datas the text field project such as preceding line number and Field Count.
If it is long-range cloud big data text data source, the remote web server of text data source is first looked for, Webservice network service is established later, and transmission big data text data source includes:
When transmitting long-range big data text data source, text data loading module need to judge current system operating status, when When the operation energy consumption ratio of system is more than first threshold, stop the teletransmission in big data text data source;When through after a period of time Afterwards, if the operation energy consumption ratio of system is no more than first threshold, log-on data source breakpoint transfer mechanism, in response to the big of user The transmission of data text data source is requested, and identifies the big data Text Flag information of data source to be obtained;It is to be obtained based on obtaining The big data Text Flag information of data source determines whether the big data text data to be sent source is that breakpoint restarts number According to source, it is comprising multiple large data files data source blocks that the breakpoint, which restarts data source, and a portion data source block is Success uploads and remaining data source block is not completed and uploaded;If the big data text data to be sent source be not breakpoint again Log-on data source is then cut into plurality of data source block to big data text data to be sent source for described, and to server Upload each data source block in big data text data source.
Preferably, text data interactive module: the big data text real-time, interactive Field Options based on generation, to big data Content of text is adjusted, and completes big data text real-time, interactive, is carried out big data text semantic query analysis and is included:
In text data interactive module, user to data load progress, file size shown in interface, file line number, Current big datas the text field project such as line number and Field Count is adjusted, in conjunction with the runnability of system, so that adjusted System operation energy consumption ratio is not higher than first threshold;Text data interactive module specification big data text semantic structure, builds big number According to the semantic normalization model of query text structure involved in text query analytic process, by the language that multi-level specification is arranged Adopted description standard obtains the semantic description degree of each text structure, semantic normalization model include content of text normative model TCRM, Text query ordering norms model SCRM, text query mode normative model SMRM and text query Business Stream normative model FWRM;Text data interactive module establishes instruction parsing and inquiry business flow model, query process control and feedback result.
Preferably, preprocessing module is using random algorithm, before being pre-processed to big text data to be processed also Include:
Data synchronization unit is arranged in internal system in preprocessing module, according to the data method of synchronization that may be taken, system A variety of data synchronization units are arranged in inside, as full dose updates the synchronous or incremental update method of synchronization.
Preferably, text data interactive module: the big data text real-time, interactive Field Options based on generation, to big data Content of text is adjusted, completion big data text real-time, interactive, after progress big data text semantic query analysis further include:
User's Local or Remote data source log-on data synchronization request, for data to be synchronized, user's Local or Remote Data source data method of synchronization according to used by system, the data synchronization unit of the data method of synchronization is supported in inquiry, by this Data to be synchronized are sent to the data synchronization unit inquired, and user's Local or Remote data source is by the data synchronization request agreement It is sent to system;It is completed after user is adjusted big data content of text and carries out semantic query analysis to big data text The data of Local or Remote are synchronous, and the synchronous content with after user adjusts and inquires of the data of Local or Remote is synchronized more Newly.
Meanwhile the application also protects a kind of big data text real-time interaction system based on random algorithm, for realizing institute The big data text real-time interaction method based on random algorithm stated, referring to attached drawing 4 characterized by comprising
Local client: for storing local big data text data source, with local text data interactive system platform into The docking of row data;
Server end: operation local data interactive system platform, for receiving local big data text data source or remote The big data text data source of journey transmission;
Network exchange platform: using File Transfer Protocol, builds webservice, completes the transmission in big data text data source.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.

Claims (3)

1. a kind of big data text real-time interaction method based on random algorithm, characterized by comprising:
Step 1: using random algorithm, big text data to be processed is pre-processed;
Step 2: the data source of load big data text generates big data text real-time, interactive Field Options;
Step 3: the big data text real-time, interactive Field Options based on generation are adjusted big data content of text, complete Big data text real-time, interactive carries out big data text semantic query analysis;
In the step 1, using random algorithm, carrying out pretreatment to big text data to be processed includes:
Step 1.1: cluster preprocessing being carried out based on big data content of text first, establishes the standard comparison number for being used for text cluster According to text, the Text Pretreatment operation including text segments and removes stop words operation is completed;
Step 1.2: according to the ability of computing resource, initialize respective thread number N, wherein N > 1, including a main thread and Then text big data is accordingly divided into N number of data block using random algorithm by N-1 sub thread, wherein text big data is big Small is S;
Step 1.3: i-th of sub thread scan line coordinate since the position (i-1) * S/N, main thread random selection summarize sub thread The size for scanning N row, estimates the line number of text big data;
Step 1.4: each sub thread saves the row coordinate of scanning into index data, and main thread is sequentially read again according to index data Access evidence;
The step 2: the data source of load big data text, generating big data text real-time, interactive Field Options includes:
Step 2.1: selecting the data source of big data text to be loaded, wherein the data source of big data text includes local big Data text data source and/or long-range cloud big data text data source;
Step 2.2: the data source of load big data text is if it is local big data text data source, then local big according to this The path searching that is locally stored of data text data source reads local big data text data source;If it is long-range cloud big data Text data source first looks for the remote web server in the long-range cloud big data text data source, establishes later Webservice network service transmits the long-range cloud big data text data source;
Step 2.3: obtaining the big data text data source loaded, therefrom extract data load progress, file size, file line Big data the text field project of several, current line number and Field Count;
If it is long-range cloud big data text data source, the telenet in the long-range cloud big data text data source is first looked for Network server establishes webservice network service later, transmits long-range cloud big data text data source and includes:
Step 2.2.1: when transmitting long-range cloud big data text data source, need to judge current operating conditions, when operation energy consumption ratio When more than first threshold, stop the teletransmission in long-range cloud big data text data source;
Step 2.2.2: when after a period of time, if operation energy consumption ratio is no more than first threshold, log-on data source breakpoint Transfer mechanism;
Step 2.2.3: the transmission in response to the long-range cloud big data text data source of user is requested, and is identified to be obtained long-range The big data Text Flag information in cloud big data text data source;
Step 2.2.4: true based on the big data Text Flag information for obtaining long-range cloud big data text data source to be obtained Whether fixed long-range cloud big data text data source to be sent is that breakpoint restarts data source, and the breakpoint restarts data source For comprising multiple large data files data source blocks, and a portion data source block has become function and uploads and remaining data source block It does not complete and uploads;
Step 2.2.5:, will if the long-range cloud big data text data source to be sent is not that breakpoint restarts data source The long-range cloud big data text data source to be sent is cut into plurality of data source block, and to server upload it is described to Each data source block in the long-range cloud big data text data source of transmission;
The step 3: the big data text real-time, interactive Field Options based on generation are adjusted big data content of text, Big data text real-time, interactive is completed, carrying out big data text semantic query analysis includes:
Step 3.1: user is to data load progress, file size shown in interface, file line number, current line number and Field Count Big data the text field project be adjusted, in conjunction with the runnability of this method, so that operation energy consumption adjusted ratio is not high In first threshold;
Step 3.2: specification big data text semantic structure is built and inquires text involved in big data text query analytic process The semantic normalization model of this structure, the semantic description standard by the way that multi-level specification is arranged obtain the semantic description of each text structure Degree, semantic normalization model include content of text normative model TCRM, text query ordering norms model SCRM, text query side Formula normative model SMRM and text query Business Stream normative model FWRM;
Step 3.3: establishing command analysis and inquiry business flow model, query process control and feedback result;
The step 1: random algorithm is used, before pre-processing to big text data to be processed further include:
Step 0: a variety of data synchronization units are arranged according to the data method of synchronization that may be taken in setting data synchronization unit, wrap It includes full dose and updates the synchronous or incremental update method of synchronization;
The step 3: the big data text real-time, interactive Field Options based on generation are adjusted big data content of text, Complete big data text real-time, interactive, carry out big data text semantic query analysis after further include:
Step 4: user's Local or Remote data source log-on data synchronization request, for data to be synchronized, user is local or remote Journey data source supports the data synchronization unit of the data method of synchronization, this is waited for according to the used data method of synchronization, inquiry Synchronous data are sent to the data synchronization unit inquired;
Step 5: being completed after user is adjusted big data content of text and carries out semantic query analysis to big data text The data of Local or Remote are synchronous, and the synchronous content with after user adjusts and inquires of the data of Local or Remote is synchronized more Newly.
2. a kind of big data text real-time, interactive device based on random algorithm, characterized by comprising:
Preprocessing module: random algorithm is used, big text data to be processed is pre-processed;
Text data loading module: the data source of load big data text generates big data text real-time, interactive Field Options;
Text data interactive module: the big data text real-time, interactive Field Options based on generation, to big data content of text into Row adjustment, completes big data text real-time, interactive, carries out big data text semantic query analysis;
The preprocessing module: using random algorithm, carries out pretreatment to big text data to be processed and includes:
Preprocessing module is based on big data content of text and carries out cluster preprocessing first, establishes the standard comparison for being used for text cluster Data text completes the Text Pretreatment operation including text segments and removes stop words operation;
According to the ability of computing resource, respective thread number N is initialized, wherein N > 1, including a main thread and N-1 son Then text big data is accordingly divided into N number of data block using random algorithm by thread, wherein text big data size is S;
I-th of sub thread scan line coordinate since the position (i-1) * S/N, main thread random selection summarize sub thread scanning N row Size, estimate the line number of text big data;
Each sub thread saves the row coordinate of scanning into index data, and main thread sequentially reads data according to index data again;
Text data loading module: the data source of load big data text generates big data text real-time, interactive Field Options packet It includes:
Text data loading module selects the data source of big data text to be loaded, wherein the data source packet of big data text Include local big data text data source and/or long-range cloud big data text data source;
Text data loading module loads the data source of big data text, if it is local big data text data source, then foundation The path searching that is locally stored in the local big data text data source reads local big data text data source;If it is long-distance cloud Big data text data source is held, the remote web server in the long-range cloud big data text data source, Zhi Houjian are first looked for Vertical webservice network service, transmits the long-range cloud big data text data source;
Text data loading module obtains the big data text data source loaded, and it is big therefrom to extract data load progress, file Big data the text field project of small, file line number, current line number and Field Count;
If it is long-range cloud big data text data source, the telenet in the long-range cloud big data text data source is first looked for Network server establishes webservice network service later, transmits long-range cloud big data text data source and includes:
When transmitting long-range cloud big data text data source, text data loading module need to judge current operating conditions, work as operation When energy consumption ratio is more than first threshold, stop the teletransmission in long-range cloud big data text data source;
When after a period of time, if the operation energy consumption ratio of device is no more than first threshold, log-on data source breakpoint is transmitted Mechanism, the transmission in response to the long-range cloud big data text data source of user are requested, and identify the big number in long-range cloud to be obtained According to the big data Text Flag information in text data source;
It is determined based on the big data Text Flag information for obtaining long-range cloud big data text data source to be obtained to be sent Whether long-range cloud big data text data source is that breakpoint restarts data source, and it is comprising multiple that the breakpoint, which restarts data source, Large data files data source block, and a portion data source block has become function upload and remaining data source block does not complete It passes;
If the long-range cloud big data text data source to be sent is not that breakpoint restarts data source, will be described to be transmitted Long-range cloud big data text data source be cut into plurality of data source block, and upload to server described to be sent long-range Each data source block in cloud big data text data source;Text data interactive module: the big data text based on generation is real When interaction fields option, big data content of text is adjusted, complete big data text real-time, interactive, carry out big data text Semantic query is analyzed
In text data interactive module, user is to data load progress, file size shown in interface, file line number, current Line number and big data the text field project of Field Count are adjusted, the runnability of coupling apparatus, so that operation adjusted Energy consumption ratio is not higher than first threshold;
Text data interactive module specification big data text semantic structure is built involved in big data text query analytic process Query text structure semantic normalization model, the semantic description standard by the way that multi-level specification is arranged obtains each text structure Semantic description degree, semantic normalization model include content of text normative model TCRM, text query ordering norms model SCRM, text This inquiry mode normative model SMRM and text query Business Stream normative model FWRM;Text data interactive module establishes instruction solution Analysis and inquiry business flow model, query process control and feedback result;
Preprocessing module is using random algorithm, before pre-processing to big text data to be processed further include:
Data synchronization unit is set in preprocessing module, according to the data method of synchronization that may be taken, it is synchronous that a variety of data are set Device, including full dose update the synchronous or incremental update method of synchronization;
Text data interactive module: the big data text real-time, interactive Field Options based on generation, to big data content of text into Row adjusts, completion big data text real-time, interactive, after progress big data text semantic query analysis further include:
User's Local or Remote data source log-on data synchronization request, for data to be synchronized, user's Local or Remote data According to the used data method of synchronization, the data synchronization unit of the data method of synchronization is supported in inquiry, this is to be synchronized in source Data are sent to the data synchronization unit inquired;
User big data content of text is adjusted and big data text is carried out complete after semantic query analysis it is local or Long-range data are synchronous, and the synchronous content with after user adjusts and inquires of the data of Local or Remote is synchronized update.
3. a kind of big data text real-time interaction system based on random algorithm, for realizing being based on as described in claim 1 The big data text real-time interaction method of random algorithm characterized by comprising
Local client: it for storing local big data text data source, is counted with local text data interactive system platform According to docking;
Server end: running local text data interactive system platform, for receiving local big data text data source or remote The big data text data source of journey transmission;
Network exchange platform: using File Transfer Protocol, builds webservice, completes the transmission in big data text data source.
CN201710802384.2A 2017-09-07 2017-09-07 A kind of big data text real-time interaction method and device based on random algorithm Expired - Fee Related CN107590125B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710802384.2A CN107590125B (en) 2017-09-07 2017-09-07 A kind of big data text real-time interaction method and device based on random algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710802384.2A CN107590125B (en) 2017-09-07 2017-09-07 A kind of big data text real-time interaction method and device based on random algorithm

Publications (2)

Publication Number Publication Date
CN107590125A CN107590125A (en) 2018-01-16
CN107590125B true CN107590125B (en) 2019-12-03

Family

ID=61051230

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710802384.2A Expired - Fee Related CN107590125B (en) 2017-09-07 2017-09-07 A kind of big data text real-time interaction method and device based on random algorithm

Country Status (1)

Country Link
CN (1) CN107590125B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111831819B (en) * 2019-06-06 2024-07-16 北京嘀嘀无限科技发展有限公司 Text updating method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104182388A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Semantic analysis based text clustering system and method
CN104182489A (en) * 2014-08-11 2014-12-03 同济大学 Query processing method for text big data
CN104239537A (en) * 2014-09-22 2014-12-24 国云科技股份有限公司 Method for realizing generating and processing flow for large-data pre-processing text data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7689557B2 (en) * 2005-06-07 2010-03-30 Madan Pandit System and method of textual information analytics

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104182388A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Semantic analysis based text clustering system and method
CN104182489A (en) * 2014-08-11 2014-12-03 同济大学 Query processing method for text big data
CN104239537A (en) * 2014-09-22 2014-12-24 国云科技股份有限公司 Method for realizing generating and processing flow for large-data pre-processing text data

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BigInsights —基于Hadoop的数据分析平台;李士达,等;《https://www.ibm.com/developerworks/cn/data/library/techarticle/dm-1108lisd/》;20110818;1-3 *
一种支持网络硬盘存储***的大数据传输技术;周娇,等;《小型微型计算机***》;20140215(第2期);329-333 *
如何快速获取大量文本文件的行数;wanghonglin1985;《https://bbs.csdn.net/topics/340087624》;20100601;1-2 *

Also Published As

Publication number Publication date
CN107590125A (en) 2018-01-16

Similar Documents

Publication Publication Date Title
CN107832468B (en) Demand recognition methods and device
CN105224606B (en) A kind of processing method and processing device of user identifier
CN110210624A (en) Execute method, apparatus, equipment and the storage medium of machine-learning process
JP5092165B2 (en) Data construction method and system
CN108021929A (en) Mobile terminal electric business user based on big data, which draws a portrait, to establish and analysis method and system
CN110489578A (en) Image processing method, device and computer equipment
EP3940555A2 (en) Method and apparatus of processing information, method and apparatus of recommending information, electronic device, and storage medium
CN108446964B (en) User recommendation method based on mobile traffic DPI data
CN102930054A (en) Data search method and data search system
CN105760443A (en) Project recommending system, device and method
CN110909182A (en) Multimedia resource searching method and device, computer equipment and storage medium
CN106227792B (en) Method and apparatus for pushed information
CN114172820B (en) Cross-domain SFC dynamic deployment method, device, computer equipment and storage medium
CN109087030A (en) Realize method, General Mobile crowdsourcing server and the system of the crowdsourcing of C2C General Mobile
CN113742488B (en) Embedded knowledge graph completion method and device based on multitask learning
CN107181776A (en) A kind of data processing method and relevant device, system
CN113157947A (en) Knowledge graph construction method, tool, device and server
CN106407377A (en) Search method and device based on artificial intelligence
CN104021125A (en) Search engine sorting method and system and search engine
US20230368028A1 (en) Automated machine learning pre-trained model selector
CN112085087A (en) Method and device for generating business rules, computer equipment and storage medium
CN110020312A (en) The method and apparatus for extracting Web page text
CN107590125B (en) A kind of big data text real-time interaction method and device based on random algorithm
CN106649380A (en) Hot spot recommendation method and system based on tag
CN108717445A (en) A kind of online social platform user interest recommendation method based on historical data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20191203

Termination date: 20200907