CN105988996B - Index file generation method and device - Google Patents

Index file generation method and device Download PDF

Info

Publication number
CN105988996B
CN105988996B CN201510039519.5A CN201510039519A CN105988996B CN 105988996 B CN105988996 B CN 105988996B CN 201510039519 A CN201510039519 A CN 201510039519A CN 105988996 B CN105988996 B CN 105988996B
Authority
CN
China
Prior art keywords
data content
field
data
rewritten
word segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510039519.5A
Other languages
Chinese (zh)
Other versions
CN105988996A (en
Inventor
朱锴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201510039519.5A priority Critical patent/CN105988996B/en
Publication of CN105988996A publication Critical patent/CN105988996A/en
Application granted granted Critical
Publication of CN105988996B publication Critical patent/CN105988996B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an index file generation method and device, wherein the method comprises the following steps: acquiring service data, wherein the service data comprises data content and a service type; acquiring a corresponding configuration file according to the service type, wherein the configuration file comprises a field preprocessing instruction and a word segmentation processing instruction; preprocessing the data content according to the field preprocessing instruction to generate preprocessed data content; and according to the word segmentation processing instruction, performing word segmentation processing on the preprocessed data contents respectively, and performing sequential arrangement processing on the word segmentation processed data contents to generate an index file with a uniform data format. The scheme of the invention can uniformly establish the index file aiming at the data of various service types, thereby simplifying the establishing process and improving the efficiency.

Description

Index file generation method and device
Technical Field
The invention belongs to the technical field of communication, and particularly relates to an index file generation method and device.
Background
With the rapid development of computer and internet technologies, the amount of information stored in network devices is becoming huge, and in order to facilitate querying the information, it is often necessary to assist users in accessing the information by means of establishing index files and the like.
In the prior art, generally, a corresponding sequential index file is generated according to a service type of data to be retrieved, and then the sequential index file is inverted to obtain an inverted index file, so that a user retrieves the data of the service type according to the inverted index file. For data of different service types, because the related factors such as keywords are different, in the prior art, an index generation system needs to be independently established for data of different service types to generate an index for a user to retrieve.
In the research and practice process of the prior art, the inventor of the present invention finds that the index generation system of the existing scheme can only target one service type, so that under the scene of more service types, a plurality of sets of index generation systems need to be built, the establishment of the system has high requirements on the professional level of operators, the whole establishment process is time-consuming, and the efficiency is low.
Disclosure of Invention
The invention aims to provide an index file generation method and an index file generation device, which can be used for establishing an index file aiming at various service data types, simplifying the establishment process and improving the efficiency.
In order to solve the above technical problems, embodiments of the present invention provide the following technical solutions:
the first aspect of the present invention provides an index file generation method, where the method includes:
acquiring service data, wherein the service data comprises data content and a service type;
acquiring a corresponding configuration file according to the service type, wherein the configuration file comprises a field preprocessing instruction and a word segmentation processing instruction;
preprocessing the data content according to the field preprocessing instruction to generate preprocessed data content;
performing word segmentation processing on the preprocessed data content respectively according to the word segmentation processing instruction;
and arranging the data content after word segmentation to generate an index file with a uniform data format.
In order to solve the above technical problems, embodiments of the present invention provide the following technical solutions:
a second aspect of the present invention provides an index file generation apparatus, wherein the apparatus includes:
the first acquisition module is used for acquiring service data, and the service data comprises data content and a service type;
the second acquisition module is used for acquiring a corresponding configuration file according to the service type, wherein the configuration file comprises a field preprocessing instruction and a word segmentation processing instruction;
the preprocessing module is used for preprocessing the data content according to the field preprocessing instruction to generate preprocessed data content;
the word segmentation module is used for performing word segmentation processing on the preprocessed data content according to the word segmentation processing instruction;
and the index generation module is used for carrying out sequential processing on the data contents after the word segmentation processing to generate an index file with a uniform data format.
Compared with the prior art, in this embodiment, according to the service types of multiple service data, the corresponding configuration file is obtained, then, according to the field preprocessing instruction of the configuration file, the data content of the service data is preprocessed, and according to the word segmentation processing instruction of the configuration file, the preprocessed data content is respectively subjected to word segmentation processing, so that the index file with the unified data format is generated. The invention processes the data by adopting the corresponding configuration files aiming at the data of different service types, then adopts the same program to segment the data content, and normalizes the service data of different formats into the index data of a uniform data format, thereby being capable of uniformly establishing the index file aiming at various service data types, simplifying the establishing process and improving the efficiency.
Drawings
The technical solution and other advantages of the present invention will become apparent from the following detailed description of specific embodiments of the present invention, which is to be read in connection with the accompanying drawings.
Fig. 1 is a schematic flowchart of an index file generation method according to a first embodiment of the present invention;
FIG. 2a is a flowchart illustrating a method for generating an index file according to a second embodiment of the present invention;
FIG. 2b and FIG. 2c are schematic diagrams of configuration interfaces of fields of an index file generation method provided by the present invention;
fig. 3a and fig. 3b are schematic flow diagrams of an index file generation method according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of an index file generating apparatus according to a fourth embodiment of the present invention;
fig. 5 is a schematic structural diagram of an index file generating apparatus according to a fifth embodiment of the present invention;
fig. 6 is a schematic structural diagram of a server according to a sixth embodiment of the present invention.
Detailed Description
Referring to the drawings, wherein like reference numbers refer to like elements, the principles of the present invention are illustrated as being implemented in a suitable computing environment. The following description is based on illustrated embodiments of the invention and should not be taken as limiting the invention with regard to other embodiments that are not detailed herein.
In the description that follows, specific embodiments of the present invention are described with reference to steps and symbols executed by one or more computers, unless otherwise indicated. Accordingly, these steps and operations will be referred to, several times, as being performed by a computer, the computer performing operations involving a processing unit of the computer in electronic signals representing data in a structured form. This operation transforms the data or maintains it at locations in the computer's memory system, which may be reconfigured or otherwise altered in a manner well known to those skilled in the art. The data maintains a data structure that is a physical location of the memory that has particular characteristics defined by the data format. However, while the principles of the invention have been described in language specific to above, it is not intended to be limited to the specific form set forth herein, but on the contrary, it is to be understood that various steps and operations described hereinafter may be implemented in hardware.
The principles of the present invention are operational with numerous other general purpose or special purpose computing, communication environments or configurations. Examples of well known computing systems, environments, and configurations that may be suitable for use with the invention include, but are not limited to, hand-held telephones, personal computers, servers, multiprocessor systems, microcomputer-based systems, mainframe-based computers, and distributed computing environments that include any of the above systems or devices.
The term "module" as used herein may be considered a software object executing on the computing system. The various components, modules, engines, and services described herein may be viewed as objects implemented on the computing system. The apparatus and method described herein are preferably implemented in software, but may also be implemented in hardware, and are within the scope of the present invention.
Also as used herein, the word "preferred" is intended to serve as an example, instance, or illustration. Any aspect or design described herein as "preferred" is not necessarily to be construed as advantageous over other aspects or designs. Rather, use of the word "preferred" is intended to present concepts in a concrete fashion. The term "or" as used in this application is intended to mean an inclusive "or" rather than an exclusive "or". That is, unless specified otherwise or clear from context, "X employs A or B" is intended to include either of the permutations as a matter of course. That is, if X employs A; b is used as X; or X employs both A and B, then "X employs A or B" is satisfied in any of the foregoing examples.
Also, although the disclosure has been shown and described with respect to one or more implementations, equivalent alterations and modifications will occur to others skilled in the art based upon a reading and understanding of this specification and the annexed drawings. The present disclosure includes all such modifications and alterations, and is limited only by the scope of the appended claims. In particular regard to the various functions performed by the above described components (e.g., elements, resources, etc.), the terms used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., that is functionally equivalent), even though not structurally equivalent to the disclosed structure which performs the function in the herein illustrated exemplary implementations of the disclosure. In addition, while a particular feature of the disclosure may have been disclosed with respect to only one of several implementations, such feature may be combined with one or more other features of the other implementations as may be desired and advantageous for a given or particular application. Furthermore, to the extent that the terms "includes," has, "" contains, "or variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term" comprising.
First embodiment
Referring to fig. 1, fig. 1 is a schematic flow chart illustrating an index file generating method according to a first embodiment of the present invention. The method comprises the following steps:
in step S101, service data is obtained, where the service data includes data content and a service type.
The index file generation method is based on a system structure of a BS (browser, server), a user uses the system through the browser, and the system supports data of multiple service types to generate index data in a uniform data format on the same platform.
In this embodiment, the service type may include: video, music, picture, etc., and correspondingly, the service data may include video data, music data, picture data, etc., and is not limited in particular here.
It can be understood that the data format of the service data in this embodiment may be divided into two parts, where one part carries information indicating the service type, and the other part carries data content corresponding to the service type.
In step S102, a corresponding configuration file is obtained according to the service type, where the configuration file includes a field preprocessing instruction and a word segmentation processing instruction.
It is understood that each service type corresponds to a configuration file, wherein the configuration file is pre-configured and stored in the server by the user according to the characteristics of the service type in actual operation.
The configuration file includes an instruction to preprocess the field of the data content and an instruction to perform word segmentation on the field of the data content, and the configuration file is generated according to the configuration of the field of each service data by the user, where the configuration of the field is not specifically limited.
In step S103, the data content is preprocessed according to the field preprocessing instruction, so as to generate a preprocessed data content.
In step S104, performing word segmentation on the preprocessed data content according to the word segmentation processing instruction;
in step S105, the data content after the word segmentation is arranged in order to generate an index file with a uniform data format.
It is understood that the steps S103 to S105 may specifically include:
each service type corresponds to a configuration file, namely each service type corresponds to a field preprocessing instruction, and each service type preprocesses the data content according to the corresponding field preprocessing instruction, so that personalized difference operation among different service types can be embodied; after the preprocessing, the business data in different formats can be sent to an order processing Unit (FSU) for order index generation, and normalized order data is obtained to adapt to data retrieval of multiple business types.
As can be seen from the above, in this embodiment, according to the service types of multiple service data, the corresponding configuration file is obtained, then, according to the field preprocessing instruction of the configuration file, the data content of the service data is preprocessed, and according to the word segmentation processing instruction of the configuration file, the preprocessed data content is respectively subjected to word segmentation processing, so as to generate the index file with the unified data format. The invention processes the data by adopting the corresponding configuration files aiming at the data of different service types, then adopts the same program to segment the data content, and normalizes the service data of different formats into the index data of a uniform data format, thereby being capable of uniformly establishing the index file aiming at various service data types, simplifying the establishing process and improving the efficiency.
Second embodiment
Referring to fig. 2, fig. 2 is a flowchart illustrating an index file generating method according to a second embodiment of the present invention. The index file generation method provided by the invention is based on a system structure of a BS (browser, server), a user uses the system through the browser, and the system supports data of various service types to generate index data in a uniform data format on the same platform.
In the embodiment of the present invention, mainly for the generation of configuration files, that is, the configuration of attribute values of fields, the method includes the steps of:
in step S201, profiles corresponding to different service types are generated, respectively.
It can be understood that each service type corresponds to a profile, wherein the service types may include: video, music, pictures, and the like, and correspondingly, the service data includes video data, music data, and picture data.
In this embodiment, the configuration file is pre-configured and stored in the server by the user according to the characteristics of the service type in the actual operation, and the configuration file includes an instruction for preprocessing the field of the data content and an instruction for segmenting the field of the data content.
In a preferred embodiment, the configuration file may be obtained based on the following steps:
step (1), acquiring field configuration information corresponding to the service type;
the field configuration information indicates preset attribute values of a plurality of fields, and the fields comprise text field fields, numerical field fields and classification field fields;
it can be understood that the service data of the present invention includes data content and service type, the data content includes a plurality of documents, the documents are composed of a plurality of fields, wherein the type of the field can be preset, and the field includes a text field, a numerical field and a classification field.
Further, the text field refers to a field of pure text information, such as: a field of "i like this singer", "this song is really good", etc., and the numerical field refers to a field of numerical or alphabetical information representing numerical values, such as: a field of "1", "5" or "one", "five", etc., and the classification field refers to a field indicating that data is classified, such as: a song may be classified as "rock and roll", "jazz", etc., and a video may be classified as "movie", "art", "news", etc.
In addition, each field comprises at least one attribute, which can also be called a configuration item, and the value of the attribute is shown in a selection box form for the user to select and configure.
And (2) configuring the attribute values of the fields according to the indication of the configuration information of the fields to obtain a configuration file corresponding to the service type.
And obtaining a configuration file corresponding to the service type according to the configuration of the attribute value of the attribute of each type of field by the user.
Based on this, in a further preferred embodiment, the attribute values of the plurality of fields may be configured based on the following specific manner, that is, step (2) may specifically include:
and (21) configuring the attribute value of the attribute of the text field according to the indication of the configuration information of the field to obtain the configured text field.
In this embodiment, the text field mainly includes text information and a field that is desired to be searched by a user; the attributes of the text field may include one or a combination of more than two of description, data length, primary key, importance, and word segmentation.
Referring to fig. 2b and fig. 2c together, fig. 2b is an illustration of an attribute configuration interface of a field, fig. 2c is an illustration of a user-defined field management interface, and the meaning of each attribute of the text field is briefly described as follows:
a. the following steps are described: meaning that the field represents, plays a role of prompting, and the attribute has no influence on the search result.
b. Data length: refers to the maximum length of the text of the field. Currently, fields with a maximum length greater than 256 bytes are classified into two stages according to whether the fields are greater than 256 bytes, and the fields with the maximum length greater than 256 bytes are called long text fields, wherein only one field can be configured as a long text field in the whole text field.
c. A main key: i.e., primary key, a field used to uniquely identify a document, called doc _ id. Wherein the field is set to a value that can be converted into a number, and specifically, the value of doc _ id is a 64-bit integer. Since the value should be uniform in the space of the uint64_ t, it is preferable to generate the Hash value, etc., where the Hash value is a value obtained by logical operation according to the data content, and different documents have different Hash values, and the Hash value becomes the identity card of each document.
d. Importance: is to indicate the importance of the text field, and can be classified as important, general, and unimportant.
e. Word segmentation mode: and the method is divided into normal participles and prefix participles. The normal word segmentation refers to segmenting words of a text according to natural semantics, and the mode can be selected by default under general conditions; the prefix segmentation is suitable for a scene that a search box prompts a drop-down box. For example, the "internal search platform section" may be divided into words such as "internal, internal search … …" so that the user may prompt the "internal search platform section" when entering "internal" in the search box.
It can be understood that the word segmentation processing instruction can be obtained according to the configuration of the word segmentation mode, so that word segmentation processing is performed on the data content according to the word segmentation processing instruction.
And (22) configuring the attribute value of the attribute of the numerical value field according to the indication of the configuration information of the field to obtain the configured numerical value field.
In this embodiment, the attribute of the numerical field includes one or a combination of two or more of description, data type, authority, importance, and primary key.
The value field is applicable to information of the value type. Such as selling price, download amount, etc. The string value in this field must be convertible to a number. The meaning of each attribute of the value field is briefly described below:
a. the following steps are described: meaning that the field represents, plays a role of prompting, and the attribute has no influence on the search result.
b. Data type: the configuration items in the embodiment can be provided with int8, uint8, int16, uint16, int32, uint32, int64, uint64 and float which are available for selection. The user selects the maximum range possible based on the value, and if the actual value in the data provided exceeds the configured range, an error will occur.
c. Authoritative: indicating that the field may embody the authority of the document. For example, for a video search, the number of views may be selected as the authoritative field. Only 0 or 1 value range field may be designated as authoritative field.
d. Importance: it is an indication of the importance of the field, which can be classified as important, general, and unimportant.
e. A main key: primary key definitions are the same as those for text field fields, and are also referred to as primary keys, which are used to uniquely identify a field of a document. Referred to as doc _ id. Wherein, the field is set to a value which can be converted into a number, and specifically, the value of doc _ id is a 64-bit integer; since this value should be uniform in the space of the uint64_ t, it is preferably generated using a Hash value or the like.
Step (23), configuring the attributes of the classified domain fields according to the indication of the configuration information of the fields to obtain the configured classified domain fields;
in this embodiment, the attributes of the classification domain field include a retrieval specification category;
and (24) generating a configuration file corresponding to the service type according to the configured text field, the configured numerical value field and the configured classification field.
In step S202, service data is acquired.
Wherein the service data comprises data content and service type; the service types may include: video, music, picture, etc., and correspondingly, the service data may include video data, music data, picture data, etc., and is not limited in particular here.
It can be understood that after configuration files corresponding to different service types are generated, the configuration files are preset in the server, and after service data of user data are acquired, the server is triggered to call out the configuration files corresponding to the service types from a plurality of preset configuration files according to the service types, so that processing is performed according to the configuration files, and an index file is generated.
In step S203, a corresponding configuration file is obtained according to the service type.
It is understood that each service type corresponds to a configuration file, wherein the configuration file is generated by the user in advance according to the configuration information in step S201 and is stored in the server.
In step S204, the data content is preprocessed according to the field preprocessing instruction, so as to generate a preprocessed data content.
Each service type corresponds to a configuration file, namely each service type corresponds to a field preprocessing instruction, and each service type respectively preprocesses the data content according to the corresponding field preprocessing instruction, for example, certain fields of service push data are rewritten, data is cleaned, data tags are supplemented, and the like, so that personalized difference operation among different service types can be embodied.
In step S205, the preprocessed data content is analyzed to determine attribute information of the data content.
In some embodiments, a preset word segmentation template may be obtained, and the preprocessed data content may be analyzed according to the word segmentation template to determine attribute information of the data content. The server is preset with a plurality of word segmentation modules, which can comprise data templates of a plurality of service types, such as music data, and the data templates can comprise a singer database, a song name database, a genre database and the like, and the attribute information of the data content can be obtained by analyzing the data templates; for example, if the data content belongs to a service type of music, the attribute information refers to an attribute of a numerical type such as a download amount and a play amount of a song.
In step S206, according to the word segmentation processing instruction and the attribute information, performing word segmentation on the preprocessed service data, and performing ranking processing on the data content after word segmentation processing, so as to generate a ranking index file with a uniform data format.
After the preprocessing, the processing can be carried out according to the attribute information and the preset word segmentation processing instruction, namely, the normalization operation is carried out, the business data with different formats are normalized into a unified data format, and the normalized in-line data is obtained so as to adapt to the data retrieval of various business types.
It is understood that after preprocessing, the data will enter the in-line processing unit FSU for in-line index generation. And calculating data information required to be used for searching word, word POS information and the like by configuring a word segmentation processing instruction in a configuration file and carrying out data processing according to several built-in word segmentation templates, and finally outputting the ordered index file with a uniform format.
It is to be understood that, after generating the ordered index file in the unified data format, the method may further include:
in step S207, the forward index file is converted into an inverted index file, so that the user can search according to the inverted index file.
As can be seen from the above, in this embodiment, according to the service types of multiple service data, the corresponding configuration file is obtained, then, according to the field preprocessing instruction of the configuration file, the data content of the service data is preprocessed, and according to the word segmentation processing instruction of the configuration file, the preprocessed data content is respectively subjected to word segmentation processing, so as to generate the index file with the unified data format. The invention processes the data by adopting the corresponding configuration files aiming at the data of different service types, then adopts the same program to segment the data content, and normalizes the service data of different formats into the index data of a uniform data format, thereby being capable of uniformly establishing the index file aiming at various service data types, simplifying the establishing process and improving the efficiency.
Third embodiment
Please refer to fig. 3a and fig. 3b, which are schematic flow charts of an index file generation method according to a third embodiment of the present invention. The index file generation method provided by the invention is based on a system structure of a BS (browser, server), a user uses the system through the browser, and the system supports data of various service types to generate index data in a uniform data format on the same platform.
In the embodiment of the present invention, the analysis is mainly performed in a process of preprocessing data content, and the method includes:
in step S301, service data is acquired.
Wherein the service data comprises data content and service type; the service types may include: video, music, picture, etc., and correspondingly, the service data may include video data, music data, picture data, etc., and is not limited in particular here.
It can be understood that after configuration files corresponding to different service types are generated, the configuration files are preset in the server, and after service data of user data are acquired, the server is triggered to call out the configuration files corresponding to the service types from a plurality of preset configuration files according to the service types, so that processing is performed according to the configuration files, and an index file is generated.
In step S302, a corresponding configuration file is obtained according to the service type.
It can be understood that each service type corresponds to a configuration file, wherein the configuration file includes a field preprocessing instruction and a word segmentation processing instruction, and the configuration file is pre-configured by a user according to the characteristics of the service type in actual operation and stored in the server.
Preferably, before acquiring the service data (i.e. step S301), the method may further include: specifically, the field configuration information corresponding to the service type may be obtained first, and then the attribute values of the fields are configured according to the indication of the field configuration information, so as to obtain the configuration file corresponding to the service type.
In the embodiment of the present invention, the fields may include a text field, a numerical field, and a numerical field, each field includes a corresponding attribute, and then configuration may be performed according to an indication attribute of configuration information of each field, so as to generate a configuration file; it is easy to think that the content of generating configuration files corresponding to different service types can be specifically implemented with reference to the description of step S201 in the above embodiment, and is not described herein again.
It can be understood that the server may include a dynamic library for index data preprocessing, and the data content may be preprocessed according to a field preprocessing instruction in the configuration file after the configuration file is obtained, so as to generate the preprocessed data content.
In this embodiment, the preprocessing of the data content mainly includes data cleaning and data rewriting, where the execution sequence of the data cleaning and the data rewriting is not limited, that is, the data cleaning may be performed first, and then the data rewriting is performed, or the data cleaning may be performed first, and then the data cleaning may be performed, or both may be performed simultaneously, and the two may not affect each other, and this example does not constitute a limitation to the present invention.
In one possible embodiment, after the configuration file is obtained, step S303A may be executed:
referring to fig. 3a, in step S303A, according to the field preprocessing instruction in the configuration file, data is first cleaned and then rewritten; the step S303A may specifically include:
step A, judging whether a junk field exists in the data content;
according to the judgment result, executing the step A1 or the step A2;
step A1, if a junk field exists, deleting the junk field from the data content, and judging whether the deleted data content needs to be rewritten;
executing the step A11 or the step A12 according to the judgment result of the step A1;
step A11, if rewriting is needed, rewriting the deleted data content, and using the rewritten data content as the preprocessed data content;
step A12, if the data content need not be rewritten, the deleted data content is used as the preprocessed data content;
step A2, if there is no garbage field, judging whether the data content needs to be rewritten;
executing the step A21 or the step A22 according to the judgment result of the step A2;
step A21, if needing to rewrite, rewriting the data content, and using the rewritten data content as the preprocessed service data; is that
And step A22, if rewriting is not needed, the data content is taken as the preprocessed data content.
In another possible implementation, after the configuration file is obtained, step S303B may be executed:
referring to fig. 3b, in step S303B, according to the field preprocessing instruction in the configuration file, data is first rewritten and then cleaned; the step S303B may specifically include:
B. judging whether the data content needs to be rewritten or not;
according to the judgment result, executing the step B1 or the step B2;
b1, if needing to rewrite, rewriting the data content and judging whether the rewritten data content has a garbage field;
executing the step B11 or the step B12 according to the judgment result of the step B1;
b11, if a junk field exists, deleting the junk field from the rewritten data content, and taking the deleted data content as the preprocessed data content;
b12, if no garbage field exists, taking the rewritten data content as the preprocessed data content;
b2, if the data content does not need to be rewritten, judging whether a junk field exists in the data content;
executing the step B21 or the step B22 according to the judgment result of the step B2;
b21, if the junk fields exist, deleting the junk fields from the data contents, and taking the deleted data contents as preprocessed data contents;
and B22, if the junk field does not exist, taking the data content as the preprocessed data content.
Further, as shown in step S303A and step S303B, the data cleansing is to remove garbage fields, such as punctuation marks, in the data content, which may affect the subsequent retrieval experience and therefore should be removed; the purpose of data rewriting is that data needs to be preprocessed before index data is generated because the data needs to be specially processed, such as separating Chinese and English mixed names in some fields into two names.
Preferably, the server may further include a dynamic library for preprocessing original data, and most preferably, the dynamic library is used for processing original service data, the processed data is used as data input of the preprocessing operation, and the data input mainly includes data expansion, format check and the like, where the data expansion means that data pushed by a part of services is not comprehensive enough to meet all search requirements of a user, and the data of the services is supplemented by capturing other resources in the internet. Such as searching videos and music, and supplementing a large amount of data of non-default system own resources; the format check refers to checking the correctness of the data pushed by the service, and checking whether the data type and the field which are inconsistent with the configuration are pushed or not.
In step S304, analyzing the preprocessed data content to determine attribute information of the data content;
in some embodiments, a preset word segmentation template may be obtained, and the preprocessed data content may be analyzed according to the word segmentation template to determine attribute information of the data content. The server is preset with a plurality of word segmentation modules, which can comprise data templates of a plurality of service types, such as music data, and the data templates can comprise a singer database, a song name database, a genre database and the like, and the attribute information of the data content can be obtained by analyzing the data templates; for example, if the data content belongs to a service type of music, the attribute information refers to an attribute of a numerical type such as a download amount and a play amount of a song.
In step S305, according to the word segmentation processing instruction and the attribute information, performing word segmentation on the preprocessed service data, and performing ranking processing on the data content after word segmentation processing, so as to generate a ranking index file with a uniform data format.
After the preprocessing, the processing can be carried out according to the attribute information and the preset word segmentation processing instruction, namely, the normalization operation is carried out, the business data with different formats are normalized into a unified data format, and the normalized in-line data is obtained so as to adapt to the data retrieval of various business types.
It is understood that after preprocessing, the data will enter the in-line processing unit FSU for in-line index generation. And calculating data information required to be used for searching word, word POS information and the like by configuring a word segmentation processing instruction in a configuration file and carrying out data processing according to several built-in word segmentation templates, and finally outputting the ordered index file with a uniform format.
It is to be understood that, after generating the ordered index file in the unified data format, the method may further include:
in step S306, the sequential index file is converted into an inverted index file, so that the user can search according to the inverted index file.
In combination with the above, the index file generated by the method is simply analyzed in an application scenario as follows:
it is understood that the generation method is based on the system structure of BS (browser, server), and the system supports the data of multiple service types to generate the index data in uniform data format under the same platform. First, the platform has implemented paged configuration, and after accessing service data, the platform needs to be informed of which data fields, types and attribute values of each field, and the like exist in the current service data.
Such as: for the novel search service, six fields are provided, wherein four fields are used as text field fields to establish indexes, and two fields are used as numerical field fields to provide relevance scoring. Selecting the fields for establishing the index can carry out semantic word segmentation on each field, calculate word and finally establish the inverted index, wherein the fields can be searched by the user.
The word segmentation mode specifies how to segment words in each field when the text index is established. Normal participles, prefix participles, category index participles, etc. are commonly used.
The normal word segmentation is to perform normal semantic word segmentation on the text, such as "weather is really good today", and the normal word segmentation is divided into four words of today/weather/true/good. The prefix word segmentation divides the sentence into six words of today, weather and weather, and the word segmentation mode is mainly used for the function of prompting the associative words. The classified index word segmentation is a higher-level usage, and can be used for some classified texts, such as the novel is divided into classes of swordsmen, sentiments, science fiction and the like, after the classified index word segmentation is used for establishing the index, the business can inquire according to the classes of the novel, such as the novel which is named as 'three-body' and the class of science fiction.
In a preferred embodiment, a function specifying intra-domain retrieval may also be provided. Such as: if only a book with a book name of "trisomy" is desired to be searched, and a book with an author name containing "trisomy" is not desired to be searched, a function of searching within a specified domain may be used. The platform adds special characters to the index to indicate that it is an index retrieved within a domain. When the user requests, the corresponding mark is added for searching, so that the document with the word of three-body in other fields can be directly eliminated in an index intersection mode.
As can be seen from the above, in this embodiment, according to the service types of multiple service data, the corresponding configuration file is obtained, then, according to the field preprocessing instruction of the configuration file, the data content of the service data is preprocessed, and according to the word segmentation processing instruction of the configuration file, the preprocessed data content is respectively subjected to word segmentation processing, so as to generate the index file with the unified data format. The invention processes the data by adopting the corresponding configuration files aiming at the data of different service types, then adopts the same program to segment the data content, and normalizes the service data of different formats into the index data of a uniform data format, thereby being capable of uniformly establishing the index file aiming at various service data types, simplifying the establishing process and improving the efficiency.
Fourth embodiment
In order to better implement the index file generation method provided by the embodiment of the present invention, an index file generation apparatus corresponding to the index file generation method is also provided in the embodiment of the present invention. The meaning of the noun is the same as that in the index file generation method, and specific implementation details can refer to the description in the method embodiment.
Referring to fig. 4, fig. 4 is a schematic structural diagram of an index file generating device according to a sixth embodiment of the present invention, where the device is based on a BS (browser, server) system structure, and a user uses the system through a browser, and the system supports data of multiple service types to generate index data in a uniform data format on the same platform.
The device comprises: a first obtaining module 401, a second obtaining module 402, a preprocessing module 403, a word segmentation module 404, and an index generation module 405.
The first obtaining module 401 is configured to obtain service data, where the service data includes data content and a service type; the second obtaining module 402 is configured to obtain a corresponding configuration file according to the service type, where the configuration file includes a field preprocessing instruction and a word segmentation processing instruction.
It is understood that the service types may include: video, music, picture, etc., and correspondingly, the service data may include video data, music data, picture data, etc., and is not limited in particular here. In addition, the data format of the service data in this embodiment may be divided into two parts, where one part carries information indicating a service type, and the other part carries data content corresponding to the service type.
Each service type corresponds to a configuration file, and the configuration file is pre-configured and stored in the index file generation device by a user according to the characteristics of the service type in actual operation.
Further, the configuration file includes an instruction to preprocess the field of the data content and an instruction to perform word segmentation on the field of the data content, and the configuration file is generated according to the configuration of the field of each service data by the user, where the configuration of the field is not particularly limited.
The preprocessing module 403 is configured to preprocess the data content according to the field preprocessing instruction, and generate a preprocessed data content; the word segmentation module 404 is configured to perform word segmentation on the preprocessed data content according to the word segmentation processing instruction; the index generating module 405 is configured to perform rank processing on the data content after the word segmentation processing, and generate an index file in a unified data format.
Each service type corresponds to a configuration file, namely each service type corresponds to a field preprocessing instruction, and each service type preprocesses the data content according to the corresponding field preprocessing instruction, so that personalized difference operation among different service types can be embodied; after the preprocessing, the business data in different formats can be sent to the in-line processing unit FSU for in-line index generation, and normalized in-line data is obtained to adapt to data retrieval of multiple business types.
As can be seen from the above, in this embodiment, according to the service types of multiple service data, the corresponding configuration file is obtained, then, according to the field preprocessing instruction of the configuration file, the data content of the service data is preprocessed, and according to the word segmentation processing instruction of the configuration file, the preprocessed data content is respectively subjected to word segmentation processing, so as to generate the index file with the unified data format. The invention processes the data by adopting the corresponding configuration files aiming at the data of different service types, then adopts the same program to segment the data content, and normalizes the service data of different formats into the index data of a uniform data format, thereby being capable of uniformly establishing the index file aiming at various service data types, simplifying the establishing process and improving the efficiency.
Fifth embodiment
Referring to fig. 5, fig. 5 is a schematic structural diagram of an index file generating device according to a fifth embodiment of the present invention. It should be noted that the index file generation apparatus provided by the present invention is based on a BS (browser, server) system structure, and a user uses the system through a browser, and the system supports data of multiple service types to generate index data in a unified data format on the same platform.
Wherein the index file generating means may include: it should be understood that, in this embodiment, the functional functions of the functional modules described above may refer to the relevant descriptions of the first obtaining module 401, the second obtaining module 402, the preprocessing module 403, the word segmentation module 404, and the index generation module 405 in the fourth embodiment, which are not described herein again.
The device further comprises: a configuration file generating module 506, configured to generate configuration files corresponding to different service types before obtaining the service data.
It can be understood that each service type corresponds to a profile, wherein the service types may include: video, music, pictures, and the like, and correspondingly, the service data includes video data, music data, and picture data. In this embodiment, the configuration file is pre-configured and stored in the server by the user according to the characteristics of the service type in the actual operation, and the configuration file includes an instruction for preprocessing the field of the data content and an instruction for segmenting the field of the data content.
In a preferred embodiment, the configuration file may be obtained based on:
the configuration file generation module 506 includes: an acquisition unit 5061 and a configuration unit 5062;
the acquiring unit 5061 is configured to acquire field configuration information corresponding to a service type, where the field configuration information indicates preset attribute values of multiple fields, and the fields include a text field, a numerical field, and a classification field.
It can be understood that the service data of the present invention includes data content and service type, the data content includes a plurality of documents, the documents are composed of a plurality of fields, wherein the type of the field can be preset, and the field includes a text field, a numerical field and a classification field. In addition, each field comprises at least one attribute, which can also be called a configuration item, and the value of the attribute is shown in a selection box form for the user to select and configure.
The configuring unit 5062 is configured to configure the attribute values of the fields according to the indication of the configuration information of the fields, so as to obtain a configuration file corresponding to the service type.
Based on this, in a further preferred embodiment, the attribute values of the plurality of fields may be configured based on the following detailed manner; the configuration unit 5062 may include: the device comprises a first configuration subunit, a second configuration subunit, a third configuration subunit and a generation subunit;
the first configuration subunit is configured to configure the attribute value of the attribute of the text field according to the indication of the configuration information of the field to obtain a configured text field, where the attribute of the text field includes one or a combination of more than two of description, data length, a primary key, importance, and a word segmentation mode;
the meaning of each attribute of the text field is simply described as follows:
the description refers to the meaning represented by the field, plays a role in prompting, and has no influence on the search result;
the data length refers to the maximum length of the text of the field. At present, the fields are divided into two stages according to whether the fields are larger than 256 bytes, the fields with the maximum length larger than 256 bytes are called long text fields, wherein only one field can be configured as the long text field in the whole text field;
the primary key, i.e., the primary key, is a field, called doc _ id, that is used to uniquely identify a document. This field must be a value that can be converted into a number. Specifically, the value of doc _ id is a 64-bit integer; since the value should be uniform in the space of the uint64_ t, the value is preferably generated by using a Hash value and the like, wherein the Hash value is a numerical value obtained by logical operation according to data content, the Hash values obtained by different documents are different, and the Hash value becomes an identity card of each document;
the importance is the importance degree of the text field, and can be divided into important, general and unimportant;
the word segmentation mode is divided into normal word segmentation and prefix word segmentation. The normal word segmentation refers to segmenting words of a text according to natural semantics, and the mode can be selected by default under general conditions; the prefix segmentation is suitable for a scene that a search box prompts a drop-down box.
The second configuration subunit is configured to configure the attribute value of the attribute of the numerical field according to the indication of the configuration information of the field to obtain a configured numerical field, where the attribute of the numerical field includes one or a combination of two or more of description, data type, authority, importance, and a primary key;
the value field is applicable to information of the value type. Such as selling price, download amount, etc. The string value in this field must be convertible to a number. The meaning of each attribute of the numerical field is simply described as follows:
the description refers to the meaning represented by the field, plays a role in prompting, and has no influence on the search result;
the data type is that the configuration items in the embodiment can be provided with int8, uint8, int16, uint16, int32, uint32, int64, uint64 and float types for selection. The user selects the maximum range possible according to the value, and if the actual value in the provided data exceeds the configured range, an error is caused;
authority is used to indicate that the field may reflect the authority of the document. For example, for a video search, the number of views may be selected as the authoritative field. Only 0 or 1 value range field may be designated as authoritative field;
the importance is the importance degree of the field, and can be divided into important, general and unimportant;
the primary key is defined the same as the primary key of a text field, and is also referred to as a primary key, which is used to uniquely identify a field of a document. Referred to as doc _ id. Wherein the field must be a value that can be converted into a number. Specifically, the value of doc _ id is a 64-bit integer; since this value should be uniform in the space of the uint64_ t, it is preferably generated using a Hash value or the like.
The third configuration subunit is configured to configure the attributes of the classification field according to the indication of the configuration information of the field, so as to obtain a configured classification field, where the attributes of the classification field include a retrieval designation category; and the generating subunit is configured to generate a configuration file corresponding to the service type according to the configured text field, the configured numerical field and the configured classification field.
Further preferably, the preprocessing module 503 may perform preprocessing on the data content according to a field preprocessing instruction in the configuration file, where the preprocessing on the data content mainly includes data cleaning and data rewriting, where an execution sequence of the data cleaning and the data rewriting is not limited, that is, the data cleaning may be performed first, and then the data rewriting is performed, or the data cleaning may be performed first, and then the data cleaning is performed, or both may be performed simultaneously, and the two do not affect each other, which does not limit the present invention.
Based on this, in an implementation manner, the preprocessing module 503 performs data cleaning first and then performs data rewriting, and the preprocessing module 503 may include: a first determining unit 5031, a first processing unit 5032 and a second processing unit 5033;
the first determining unit 5031 is configured to determine whether a spam field exists in the data content;
the first processing unit 5032, if a garbage field exists, deleting the garbage field from the data content, and determining whether the deleted data content needs to be rewritten, if yes, rewriting the deleted data content, and using the rewritten data content as the preprocessed data content; if the data content does not need to be rewritten, the deleted data content is used as the preprocessed data content;
the second processing unit 5033, configured to determine whether the data content needs to be rewritten if there is no spam field, and rewrite the data content if the data content needs to be rewritten, where the rewritten data content is used as the pre-processed service data; and if the data content does not need to be rewritten, taking the data content as the preprocessed data content.
In another possible implementation manner, the preprocessing module 503 performs data rewriting and then performs data cleaning, and the preprocessing module 503 may include: a second determining unit 5034, a third processing unit 5035 and a fourth processing unit 5036;
the second determining unit 5034 is configured to determine whether the data content needs to be rewritten;
the third processing unit 5035, if rewriting is required, is configured to rewrite the data content, and determine whether a garbage field exists in the rewritten data content, if a garbage field exists, delete the garbage field from the rewritten data content, and use the deleted data content as a preprocessed data content, and if no garbage field exists, use the rewritten data content as a preprocessed data content;
the fourth processing unit 5036 is configured to determine whether a spam field exists in the data content if the data content does not need to be rewritten, delete the spam field from the data content if the spam field exists, use the deleted data content as the preprocessed data content, and use the data content as the preprocessed data content if the spam field does not exist.
Still further, the word segmentation module 504 may include: the attribute information determining unit is used for analyzing the preprocessed data content to determine the attribute information of the data content; and the word segmentation unit is used for segmenting the preprocessed service data according to the word segmentation processing indication and the attribute information to generate data contents after word segmentation processing.
In some embodiments, the attribute information determination unit may include: the acquisition subunit is used for acquiring a preset word segmentation template; and the determining subunit is used for analyzing the preprocessed data content according to the word segmentation template and determining the attribute information of the data content. The server is preset with a plurality of word segmentation modules, and can include data templates of a plurality of service types, such as music data, and the data templates can include a singer database, a song title database, a genre database and the like, and the attribute information of the data content can be obtained by analyzing the data templates.
After the preprocessing, the processing can be carried out according to the attribute information and the preset word segmentation processing instruction, namely, the normalization operation is carried out, the business data with different formats are normalized into a unified data format, and the normalized in-line data is obtained so as to adapt to the data retrieval of various business types.
It is understood that after preprocessing, the data will enter the in-line processing unit FSU for in-line index generation. And calculating data information required to be used for searching word, word POS information and the like by configuring a word segmentation processing instruction in a configuration file and carrying out data processing according to several built-in word segmentation templates, and finally outputting the ordered index file with a uniform format.
It is to be understood that, after generating the in-line index file in the unified data format, the apparatus may further include: a conversion module 507, configured to convert the forward-arranged index file into an inverted-arranged index file, so that a user may retrieve the forward-arranged index file according to the inverted-arranged index file.
As can be seen from the above, in this embodiment, according to the service types of multiple service data, the corresponding configuration file is obtained, then, according to the field preprocessing instruction of the configuration file, the data content of the service data is preprocessed, and according to the word segmentation processing instruction of the configuration file, the preprocessed data content is respectively subjected to word segmentation processing, so as to generate the index file with the unified data format. The invention processes the data by adopting the corresponding configuration files aiming at the data of different service types, then adopts the same program to segment the data content, and normalizes the service data of different formats into the index data of a uniform data format, thereby being capable of uniformly establishing the index file aiming at various service data types, simplifying the establishing process and improving the efficiency.
Sixth embodiment
An embodiment of the present invention further provides a server, in which the index file generating apparatus according to the embodiment of the present invention may be integrated, as shown in fig. 6, which shows a schematic structural diagram of the server according to the embodiment of the present invention, specifically:
the server may include components such as a processor 601 of one or more processing cores, memory 602 of one or more computer-readable storage media, Radio Frequency (RF) circuitry 603, a power supply 604, an input unit 605, and a display unit 606. Those skilled in the art will appreciate that the server architecture shown in FIG. 6 is not meant to be limiting, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:
the processor 601 is a control center of the server, connects various parts of the entire server using various interfaces and lines, and performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 602 and calling data stored in the memory 602, thereby performing overall monitoring of the server. Optionally, processor 601 may include one or more processing cores; preferably, the processor 601 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 601.
The memory 602 may be used to store software programs and modules, and the processor 601 executes various functional applications and data processing by operating the software programs and modules stored in the memory 602. The memory 602 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the server, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 602 may also include a memory controller to provide the processor 601 with access to the memory 602.
RF circuit 603 may be used for receiving and transmitting signals during the process of transmitting and receiving information, and in particular, for receiving downlink information of a base station and then processing the received downlink information by one or more processors 601; in addition, data relating to uplink is transmitted to the base station. In general, the RF circuitry 603 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 603 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Messaging Service (SMS), and the like.
The server also includes a power supply 604 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 601 via a power management system to manage charging, discharging, and power consumption management functions via the power management system. The power supply 604 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.
The server may also include an input unit 605, and the input unit 605 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.
The server may also include a display unit 606, and the display unit 606 may be used to display information input by the user or provided to the user, as well as various graphical user interfaces of the server, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 608 may include a Display panel, and optionally, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-emitting diode (OLED), or the like.
Specifically, in this embodiment, the processor 601 in the server loads the executable file corresponding to the process of one or more application programs into the memory 602 according to the following instructions, and the processor 601 runs the application programs stored in the memory 602, thereby implementing various functions as follows:
acquiring service data, wherein the service data comprises data content and a service type; acquiring a corresponding configuration file according to the service type, wherein the configuration file comprises a field preprocessing instruction and a word segmentation processing instruction; preprocessing the data content according to the field preprocessing instruction to generate preprocessed data content; performing word segmentation processing on the preprocessed data content respectively according to the word segmentation processing instruction; and arranging the data content after word segmentation to generate an index file with a uniform data format.
Preferably, the processor 601 is further configured to: configuration files corresponding to different service types are respectively generated.
Further, field configuration information corresponding to the service type is obtained, wherein the field configuration information indicates preset attribute values of a plurality of fields, and the fields comprise text field fields, numerical field fields and classification field fields; and configuring the attribute values of the fields according to the indication of the configuration information of the fields to obtain a configuration file corresponding to the service type.
Preferably, the processor 601 is further configured to: judging whether a junk field exists in the data content;
if the junk field exists, deleting the junk field from the data content, judging whether the deleted data content needs to be rewritten, if so, rewriting the deleted data content, and taking the rewritten data content as the preprocessed data content; if the data content does not need to be rewritten, the deleted data content is used as the preprocessed data content;
if no junk field exists, judging whether the data content needs to be rewritten, if yes, rewriting the data content, and taking the rewritten data content as the preprocessed service data; and if the data content does not need to be rewritten, taking the data content as the preprocessed data content.
Preferably, the processor 601 is further configured to: judging whether the data content needs to be rewritten or not;
if the data content needs to be rewritten, rewriting the data content, and judging whether a junk field exists in the rewritten data content, if so, deleting the junk field from the rewritten data content, taking the deleted data content as the preprocessed data content, and if not, taking the rewritten data content as the preprocessed data content;
if the data content does not need to be rewritten, judging whether a junk field exists in the data content, if so, deleting the junk field from the data content, taking the deleted data content as the preprocessed data content, and if not, taking the data content as the preprocessed data content.
Preferably, the processor 601 is further configured to:
configuring the attribute value of the attribute of the text field according to the indication of the configuration information of the field to obtain the configured text field, wherein the attribute of the text field comprises one or a combination of more than two of description, data length, main key, importance and word segmentation mode;
configuring the attribute values of the attributes of the numerical field according to the indication of the configuration information of the fields to obtain the configured numerical field, wherein the attributes of the numerical field comprise one or a combination of more than two of description, data type, authority, importance and primary key;
configuring the attributes of the classified field according to the indication of the field configuration information to obtain the configured classified field, wherein the attributes of the classified field comprise retrieval designated categories;
and generating a configuration file corresponding to the service type according to the configured text field, the configured numerical field and the configured classification field.
Preferably, the processor 601 is further configured to: analyzing the preprocessed data content to determine attribute information of the data content; and performing word segmentation on the preprocessed service data according to the word segmentation processing indication and the attribute information to generate data content after word segmentation processing.
Further, the sequential index file is converted into an inverted index file, so that a user can search according to the inverted index file.
Preferably, the processor 601 is further configured to: acquiring a preset word segmentation template; and analyzing the preprocessed data content according to the word segmentation template to determine the attribute information of the data content.
It can be understood that, in the foregoing embodiments, descriptions of the respective embodiments have respective emphasis, and a part which is not described in detail in a certain embodiment may refer to the above detailed description of the corresponding index file generation method, which is not described herein again.
As can be seen from the above, the server provided in this embodiment obtains the corresponding configuration files according to the service types of the multiple service data, then preprocesses the data contents of the service data according to the field preprocessing instruction of the configuration files, and performs the word segmentation processing on the preprocessed data contents according to the word segmentation processing instruction of the configuration files, so as to generate the index file with the unified data format. The invention processes the data by adopting the corresponding configuration files aiming at the data of different service types, then adopts the same program to segment the data content, and normalizes the service data of different formats into the index data of a uniform data format, thereby being capable of uniformly establishing the index file aiming at various service data types, simplifying the establishing process and improving the efficiency.
The index file generation device provided in the embodiment of the present invention is, for example, a computer, a tablet computer, a mobile phone with a touch function, and the like, and the index file generation device and the corresponding index file generation method in the above embodiments belong to the same concept, and any method provided in the embodiment of the index file generation method may be correspondingly operated on the index file generation device, and a specific implementation process thereof is described in detail in the corresponding embodiment of the index file generation method, and is not described herein again.
It should be noted that, for the index file generation method of the present invention, it can be understood by a person skilled in the art that all or part of the processes for implementing the index file generation method of the embodiments of the present invention can be completed by controlling the related hardware through a computer program, where the computer program can be stored in a computer-readable storage medium, such as a memory of a terminal, and executed by at least one processor in the terminal, and during the execution process, the processes of the embodiments of the index file generation method can be included. The storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like.
For the index file generation apparatus according to the embodiment of the present invention, each functional module may be integrated into one processing chip, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium, such as a read-only memory, a magnetic or optical disk, or the like.
The method and the apparatus for generating an index file according to the embodiments of the present invention are described in detail above, and a specific example is applied in the description to explain the principle and the implementation of the present invention, and the description of the above embodiments is only used to help understand the method and the core idea of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (12)

1. An index file generation method, characterized in that the method comprises:
acquiring field configuration information corresponding to a service type, wherein the field configuration information indicates preset attribute values of a plurality of fields, and the fields comprise text field fields, numerical field fields and classification field fields;
configuring the attribute values of the fields according to the indication of the configuration information of the fields to obtain configuration files corresponding to the service types so as to respectively generate the configuration files corresponding to a plurality of different service types;
acquiring service data, wherein the service data comprises data content and a service type;
acquiring a corresponding configuration file according to the service types, wherein the configuration file comprises a field preprocessing instruction and a word segmentation processing instruction, and each service type corresponds to one field preprocessing instruction;
for each service type, preprocessing the data content according to a corresponding field preprocessing instruction to generate preprocessed data content so as to embody personalized difference operation among different service types, wherein the preprocessing of the data content comprises data cleaning and data rewriting;
analyzing the preprocessed data content according to a preset word segmentation template to determine attribute information of the data content, wherein the preset word segmentation template comprises data templates of different service types;
and according to the word segmentation processing instruction and the attribute information, performing word segmentation processing on the preprocessed data content respectively, performing ranking processing on the data content after word segmentation processing, generating a ranking index file with a uniform data format, and performing normalization operation on data of different service types, wherein the data format is divided into two parts, one part bears information indicating the service type, and the other part bears the data content corresponding to the service type.
2. The method for generating an index file according to claim 1, wherein the preprocessing the data content according to the corresponding field preprocessing instruction to generate the preprocessed data content comprises:
judging whether a junk field exists in the data content;
if the junk field exists, deleting the junk field from the data content, judging whether the deleted data content needs to be rewritten, if so, rewriting the deleted data content, and taking the rewritten data content as the preprocessed data content;
if the data content does not need to be rewritten, the deleted data content is used as the preprocessed data content;
if no junk field exists, judging whether the data content needs to be rewritten, if yes, rewriting the data content, and taking the rewritten data content as the preprocessed service data; and if the data content does not need to be rewritten, taking the data content as the preprocessed data content.
3. The method for generating an index file according to claim 1, wherein the preprocessing the data content according to the corresponding field preprocessing instruction to generate the preprocessed data content comprises:
judging whether the data content needs to be rewritten or not;
if the data content needs to be rewritten, rewriting the data content, and judging whether a junk field exists in the rewritten data content, if so, deleting the junk field from the rewritten data content, taking the deleted data content as the preprocessed data content, and if not, taking the rewritten data content as the preprocessed data content;
if the data content does not need to be rewritten, judging whether a junk field exists in the data content, if so, deleting the junk field from the data content, taking the deleted data content as the preprocessed data content, and if not, taking the data content as the preprocessed data content.
4. The method according to claim 1, wherein configuring the attribute values of the fields according to the indication of the configuration information of the fields to obtain the configuration file corresponding to the service type comprises:
configuring the attribute value of the attribute of the text field according to the indication of the configuration information of the field to obtain the configured text field, wherein the attribute of the text field comprises one or a combination of more than two of description, data length, main key, importance and word segmentation mode;
configuring the attribute values of the attributes of the numerical field according to the indication of the configuration information of the fields to obtain the configured numerical field, wherein the attributes of the numerical field comprise one or a combination of more than two of description, data type, authority, importance and primary key;
configuring the attributes of the classified field according to the indication of the field configuration information to obtain the configured classified field, wherein the attributes of the classified field comprise retrieval designated categories;
and generating a configuration file corresponding to the service type according to the configured text field, the configured numerical field and the configured classification field.
5. The method for generating an index file according to claim 1, wherein after the data content after the word segmentation is sequentially processed to generate the sequentially arranged index file with a uniform data format, the method further comprises:
and converting the forward index file into an inverted index file so that a user can search according to the inverted index file.
6. An index file generation apparatus, comprising:
the configuration file generation module comprises an acquisition unit and a configuration unit;
the acquiring unit is configured to acquire field configuration information corresponding to a service type, where the field configuration information indicates preset attribute values of multiple fields, and the fields include a text field, a numerical field, and a classification field;
the configuration unit is configured to configure the attribute values of the fields according to the indication of the configuration information of the fields to obtain configuration files corresponding to the service types, so as to generate configuration files corresponding to a plurality of different service types respectively;
the first acquisition module is used for acquiring service data, and the service data comprises data content and a service type;
a second obtaining module, configured to obtain a corresponding configuration file according to the service type, where the configuration file includes a field preprocessing instruction and a word segmentation processing instruction, and each service type corresponds to a field preprocessing instruction;
the preprocessing module is used for preprocessing the data content according to the corresponding field preprocessing instruction aiming at each service type to generate the preprocessed data content so as to reflect the personalized difference operation among different service types, wherein the preprocessing of the data content comprises data cleaning and data rewriting;
the word segmentation module comprises an attribute information determination unit and a word segmentation unit;
the attribute information determining unit is used for analyzing the preprocessed data content according to a preset word segmentation template to determine attribute information of the data content, wherein the preset word segmentation template comprises data templates of different service types;
the word segmentation unit is used for segmenting the preprocessed service data according to the word segmentation processing indication and the attribute information;
and the index generation module is used for carrying out ranking processing on the data content after the word segmentation processing to generate a ranking index file with a uniform data format so as to carry out normalization operation on data with different service types, wherein the data format is divided into two parts, one part bears information indicating the service type, and the other part bears the data content corresponding to the service type.
7. The index file generation apparatus of claim 6, wherein the preprocessing module comprises:
the first judging unit is used for judging whether a junk field exists in the data content;
a first processing unit, configured to delete a garbage field from the data content if the garbage field exists, and determine whether the deleted data content needs to be rewritten, and if the deleted data content needs to be rewritten, rewrite the deleted data content, and use the rewritten data content as a preprocessed data content; if the data content does not need to be rewritten, the deleted data content is used as the preprocessed data content;
the second processing unit is used for judging whether the data content needs to be rewritten if no junk field exists, and rewriting the data content if the data content needs to be rewritten, wherein the rewritten data content is used as the preprocessed service data; and if the data content does not need to be rewritten, taking the data content as the preprocessed data content.
8. The index file generation apparatus of claim 6, wherein the preprocessing module comprises:
a second judging unit configured to judge whether the data content needs to be rewritten;
a third processing unit, configured to rewrite the data content if rewriting is required, and determine whether a garbage field exists in the rewritten data content, delete the garbage field from the rewritten data content if a garbage field exists, and use the deleted data content as a preprocessed data content, and if a garbage field does not exist, use the rewritten data content as a preprocessed data content;
and the fourth processing unit is used for judging whether a junk field exists in the data content if the data content does not need to be rewritten, deleting the junk field from the data content if the data content exists, taking the deleted data content as the preprocessed data content, and taking the data content as the preprocessed data content if the data content does not exist.
9. The index file generation apparatus according to claim 6, wherein the configuration unit includes:
the first configuration subunit is configured to configure the attribute value of the attribute of the text field according to the indication of the configuration information of the field to obtain a configured text field, where the attribute of the text field includes one or a combination of more than two of description, data length, a primary key, importance, and a word segmentation mode;
the second configuration subunit is configured to configure the attribute value of the attribute of the numerical field according to the indication of the configuration information of the field to obtain a configured numerical field, where the attribute of the numerical field includes one or a combination of more than two of description, data type, authority, importance, and a primary key;
a third configuration subunit, configured to configure, according to the indication of the configuration information of the field, the attribute of the classified field to obtain a configured classified field, where the attribute of the classified field includes a retrieval designation category;
and the generating subunit is used for generating a configuration file corresponding to the service type according to the configured text field, the configured numerical value field and the configured classification field.
10. The index file generating apparatus according to claim 6, wherein the apparatus further comprises:
and the conversion module is used for converting the forward index file into an inverted index file so that a user can retrieve the inverted index file conveniently.
11. A computer-readable storage medium storing a computer program, wherein the computer program causes a computer to perform the method according to any one of claims 1 to 5.
12. A server comprising a processor and a memory, the memory storing a computer program for index file generation, wherein the computer program loaded into the memory by the processor performs the method of any one of claims 1 to 5.
CN201510039519.5A 2015-01-27 2015-01-27 Index file generation method and device Active CN105988996B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510039519.5A CN105988996B (en) 2015-01-27 2015-01-27 Index file generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510039519.5A CN105988996B (en) 2015-01-27 2015-01-27 Index file generation method and device

Publications (2)

Publication Number Publication Date
CN105988996A CN105988996A (en) 2016-10-05
CN105988996B true CN105988996B (en) 2020-04-10

Family

ID=57034424

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510039519.5A Active CN105988996B (en) 2015-01-27 2015-01-27 Index file generation method and device

Country Status (1)

Country Link
CN (1) CN105988996B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108241713B (en) * 2016-12-27 2021-12-28 南京烽火星空通信发展有限公司 Inverted index retrieval method based on multi-element segmentation
CN107179953B (en) * 2017-03-31 2020-04-03 北京奇艺世纪科技有限公司 Index file generation method, device and system
CN107256206B (en) * 2017-05-24 2021-04-30 北京京东尚科信息技术有限公司 Method and device for converting character stream format
CN109327321B (en) * 2017-08-01 2021-10-15 中兴通讯股份有限公司 Network model service execution method and device, SDN controller and readable storage medium
CN108062297B (en) * 2017-11-22 2021-06-15 深圳市亿图软件有限公司 PDF file text field creating method and device and terminal equipment
CN109241098B (en) * 2018-08-08 2022-02-18 南京中新赛克科技有限责任公司 Query optimization method for distributed database
CN109783444A (en) * 2018-12-26 2019-05-21 亚信科技(中国)有限公司 Multichannel file index method, device, computer equipment and storage medium
CN110427368B (en) * 2019-07-12 2022-07-12 深圳绿米联创科技有限公司 Data processing method and device, electronic equipment and storage medium
CN110489417B (en) * 2019-07-25 2023-03-28 深圳壹账通智能科技有限公司 Data processing method and related equipment
CN110990126A (en) * 2019-12-12 2020-04-10 北京明略软件***有限公司 Method and device for realizing shortcut front-end service page based on js
CN113468393A (en) * 2021-06-09 2021-10-01 北京达佳互联信息技术有限公司 Index generation method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567418A (en) * 2010-12-23 2012-07-11 北大方正集团有限公司 Methods and devices for integrating and searching data
CN104199977A (en) * 2014-09-24 2014-12-10 浪潮软件股份有限公司 Method for creating information search based on data in database

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140032703A1 (en) * 2008-05-30 2014-01-30 Matthew A. Wormley System and method for an expandable computer storage system
CN103823799A (en) * 2012-11-16 2014-05-28 镇江诺尼基智能技术有限公司 New-generation industry knowledge full-text search method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567418A (en) * 2010-12-23 2012-07-11 北大方正集团有限公司 Methods and devices for integrating and searching data
CN104199977A (en) * 2014-09-24 2014-12-10 浪潮软件股份有限公司 Method for creating information search based on data in database

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
企业非结构化数据检索研究;徐树振 等;《信息技术》;20140425(第4期);第196-200页 *

Also Published As

Publication number Publication date
CN105988996A (en) 2016-10-05

Similar Documents

Publication Publication Date Title
CN105988996B (en) Index file generation method and device
US9336202B2 (en) Method and system relating to salient content extraction for electronic content
US10235427B2 (en) Entity-driven logic for improved name-searching in mixed-entity lists
WO2019086996A1 (en) Ranking of documents based on their semantic richness
EP3016003A1 (en) Search method, apparatus and device and non-volatile computer storage medium
US20200204688A1 (en) Picture book sharing method and apparatus and system using the same
CN111666383A (en) Information processing method, information processing device, electronic equipment and computer readable storage medium
CN110245357B (en) Main entity identification method and device
CN102004633B (en) Method and device for processing control property
CN111858581B (en) Paging query method and device, storage medium and electronic equipment
CN101777067A (en) System for recognizing and managing web page contents for mobile communication equipment terminals
CN104978341A (en) File processing method and equipment, and network system
CN107168627B (en) Text editing method and device for touch screen
CN112115228A (en) Searching method, searching device, terminal and storage medium
WO2018133228A1 (en) Method for managing application snippet and terminal device
US11669555B2 (en) System and method of creating index
CN109803173B (en) Video transcoding method and device and storage device
CN112307273A (en) Information query method and device and electronic equipment
CN112052361A (en) Searching method, searching device, terminal and storage medium
CN113127502B (en) Method, device and terminal equipment for dynamic expansion of relational database
CN115422131B (en) Business audit knowledge base retrieval method, device, equipment and computer readable medium
CN117593059B (en) Search advertisement filtering method and device, storage medium and electronic equipment
JP2010501927A (en) Information terminal equipped with content search system
CN117149951A (en) Intelligent retrieval method and device, electronic equipment and storage medium
CN112380410A (en) Information processing method and device and electronic equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant