CN112825078A

CN112825078A - Information processing method and device

Info

Publication number: CN112825078A
Application number: CN201911148659.0A
Authority: CN
Inventors: 李银锋; 黄明星; 周彬; 刘婷婷; 黄建杰; 赖晨东
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2019-11-21
Filing date: 2019-11-21
Publication date: 2021-05-21

Abstract

The invention discloses an information processing method and device, and relates to the technical field of computers. One specific implementation mode of the method comprises the steps of obtaining text information to be processed, and performing word segmentation processing to extract M keywords; inputting M keywords into a trained word vector model to obtain M word vectors, and clustering the M word vectors to generate N near-meaning word sets; and converting the text information to be processed into N-dimensional vectors based on the N synonym sets, and clustering the N-dimensional vectors to obtain a classification result of the text information to be processed. Therefore, the method and the device can solve the problems that the classification of the quality problems of the existing articles mainly depends on manpower, the cost is high and the efficiency is low.

Description

Information processing method and device

Technical Field

The present invention relates to the field of computer technologies, and in particular, to an information processing method and apparatus.

Background

With the development of the times, the material culture level of people is improved, the quality requirements of people on articles are higher and higher, and in order to ensure that high-quality articles are provided, special quality control personnel are required to strictly control the quality of the articles. The quality control personnel firstly collect the quality problems of the goods fed back by the consumers, then the problems are gathered and classified, the problems are fed back to the goods producer, and finally the producer improves the goods according to the specific problems. At present, the consumption behaviors of people are changed from off-line to on-line, and complaints about quality problems of goods are also changed to on-line, such as embodied in goods comments and the like. Nowadays, quality control personnel mainly generalize and classify the quality problems of the articles according to feedback texts.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

at present, the classification of the quality problems of the articles mainly depends on quality control personnel, and a large amount of texts such as customer comments, complaints and the like are browsed to manually summarize and summarize. On the one hand, this approach requires a significant amount of manpower to complete. On the other hand, people have limited energy and limited daily text browsing, resulting in low efficiency.

Disclosure of Invention

In view of this, embodiments of the present invention provide an information processing method and apparatus, which can solve the problems of high cost and low efficiency due to the fact that the classification of the existing article quality problem mainly depends on labor.

In order to achieve the above object, according to an aspect of the embodiments of the present invention, there is provided an information processing method, including obtaining text information to be processed, and performing word segmentation processing to extract M keywords; inputting M keywords into a trained word vector model to obtain M word vectors, and clustering the M word vectors to generate N near-meaning word sets; and converting the text information to be processed into N-dimensional vectors based on the N synonym sets, and clustering the N-dimensional vectors to obtain a classification result of the text information to be processed.

Optionally, acquiring text information to be processed to perform word segmentation processing, including:

acquiring text information to be processed, deleting contents which belong to preset character types and exist in the text information, and completing preprocessing of the text to be processed;

determining a word segmentation tool, and performing word segmentation processing on the preprocessed text information.

Optionally, extracting M keywords includes:

calculating an importance value of each participle through a tf-idf algorithm, and sequencing the participles from large to small according to the importance value;

and extracting the first M participles as keywords.

Optionally, comprising:

the word vector model is a word2vec model; the dimension of the output word feature vector is 100, and the maximum distance between the current word and the predicted word is 4.

Optionally, clustering the M word vectors to generate N sets of near-meaning words includes:

and aggregating the M word vectors into N classes by using a K-means clustering algorithm, and further generating N near-meaning word sets.

Optionally, converting the text information to be processed into an N-dimensional vector based on the N sets of near-meaning words, including:

judging whether the text information to be processed has a similar meaning word set N_iIf so, the ith dimension of the N-dimensional vector is 1, otherwise, the ith dimension of the N-dimensional vector is 0; repeating the above process until the text to be processed is processedThe information is encoded into an N-dimensional vector.

Optionally, the method further comprises:

clustering the N-dimensional vectors based on a density clustering algorithm to obtain a classification result of the text information to be processed;

and evaluating the classification result by calculating cosine similarity.

In addition, according to an aspect of the embodiments of the present invention, there is provided an information processing apparatus, including an obtaining module, configured to obtain text information to be processed, and perform word segmentation processing to extract M keywords; the generating module is used for inputting M keywords into the trained word vector model to obtain M word vectors so as to cluster the M word vectors to generate N near-meaning word sets; and the processing module is used for converting the text information to be processed into N-dimensional vectors based on the N synonym sets, and clustering the N-dimensional vectors to obtain a classification result of the text information to be processed.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any of the above-described computing embodiments.

According to another aspect of an embodiment of the present invention, there is also provided a computer-readable medium, on which a computer program is stored, which when executed by a processor implements any of the methods described above based on the computing embodiments.

One embodiment of the above invention has the following advantages or benefits: the method comprises the steps of performing word segmentation processing by acquiring text information to be processed to extract M keywords; inputting M keywords into a trained word vector model to obtain M word vectors, and clustering the M word vectors to generate N near-meaning word sets; and converting the text information to be processed into N-dimensional vectors based on the N synonym sets, and clustering the N-dimensional vectors to obtain a classification result of the text information to be processed. Therefore, the invention can solve the problems of low efficiency and excessive human resource input in the classification of the quality problem of the articles.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of a main flow of an information processing method according to a first embodiment of the present invention

Fig. 2 is a schematic diagram of a main flow of an information processing method according to a second embodiment of the present invention;

fig. 3 is a schematic diagram of main blocks of an information processing apparatus according to an embodiment of the present invention;

FIG. 4 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 5 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram of a main flow of an information processing method according to a first embodiment of the present invention, which may include:

step S101, acquiring text information to be processed, and performing word segmentation processing to extract M keywords.

In some embodiments, text information to be processed is acquired, content belonging to a preset character type in the text information is deleted, and then preprocessing of the text to be processed is completed. And determining a word segmentation tool so as to perform word segmentation processing on the preprocessed text information. That is, the obtained text information to be processed is subjected to preprocessing, such as deleting some special characters, expressive characters, and the like. Therefore, the text information to be processed is preprocessed in the embodiment, some useless text information to be processed can be deleted to a great extent, and subsequent processing efficiency and accuracy are facilitated.

Further, Chinese segmentation tools such as Chinese segmentation of Chinese segmentation, Chinese segmentation of ancient segmentation, and the like may be used. In this embodiment, the preprocessed text is subjected to word segmentation processing by using the knot segmentation words.

It is also worth mentioning that an importance value is calculated for each participle by the tf-idf algorithm to order the participles from large to small according to the importance value. Then, the top M segmented words are extracted as keywords. The tf-idf algorithm is a commonly used weighting technique for information retrieval and data mining.

Step S102, inputting M keywords into the trained word vector model to obtain M word vectors, and clustering the M word vectors to generate N synonym sets.

In some embodiments, the word vector model is a word2vec model. Wherein, the dimension size of the output word feature vector is set to 100, and the maximum distance windows between the current word and the predicted word is set to 4. Therefore, the embodiment can realize the expression of the keywords in the form of vectors, and further can perform clustering processing.

Further, the training of the word2vec model is realized through a word2vec tool integrated in the Gensim open source package. Gensim is an open-source Python library, and supports various model algorithms including word2 vec. word2vec is the model used to generate the word vector.

In addition, when clustering is performed on the M word vectors to generate N synonym sets, the M word vectors are aggregated into N classes through a K-means clustering algorithm, and then the N synonym sets are generated.

Preferably, the M word vectors are grouped into N classes by the KMeans algorithm, and similar words are grouped together to form N synonym sets. That is, the similar meaning words are grouped together, and the similar meaning words describe the similar commodity quality problem. Further, a KMeans method in a scimit-spare open source packet is used for training a KMeans model, and the initial value is set by adopting a method of KMeans + +. Wherein, scimit-spare is an open-source Python library and supports various model algorithms including KMeans. KMeans + + is a method for KMeans to initialize cluster center point. It can be seen that M word vectors can be grouped into N classes through the above process to form N synonym sets, thereby realizing the normalization processing of the synonyms.

Step S103, based on the N synonym sets, converting the text information to be processed into N-dimensional vectors, and clustering the N-dimensional vectors to obtain a classification result of the text information to be processed.

As another embodiment, whether the text information to be processed has the similar meaning word set N or not is judged_iIf so, the ith dimension of the N-dimensional vector is 1, otherwise, the ith dimension of the N-dimensional vector is 0. And repeating the process until the text information to be processed is coded into an N-dimensional vector.

In addition, the N-dimensional vectors can be clustered based on a density clustering algorithm to obtain a classification result of the text information to be processed. And evaluating the classification result by calculating cosine similarity.

Further, clustering is carried out on the N-dimensional vectors through the DBSCAN, and finally a classification result of the commodity quality problem is obtained. Wherein, DBSCAN: Density-Based Spatial Clustering of Applications with Noise is a Density-Based Clustering algorithm.

It should be noted that cosine similarity is selected as a standard for similarity measurement, and a calculation formula of cosine similarity is as follows:

where A and B are text vectors based on keywords.

Therefore, the information processing method provided by the invention extracts keywords from the text data, obtains a near meaning word set through a clustering algorithm, then encodes the text data into vectors through the near meaning word set, and obtains the final article quality multi-classification result through the clustering algorithm again for the encoded vectors. In addition, the text data clustering result is mainly focused on the keyword dimension, and as the text data vector is obtained by keyword coding, the non-keywords in the text do not influence the clustering result, so that the robustness of the text data clustering is enhanced. In addition, the invention gathers the similar meaning words into a similar meaning word set through a clustering algorithm, realizes the normalization of the similar meaning words, increases the possibility of gathering text data describing similar quality problems, and improves the clustering effect.

Fig. 2 is a schematic diagram of a main flow of an information processing method according to a second embodiment of the present invention, which may include:

step S201, acquiring text information to be processed, deleting the content which belongs to the preset character type and exists in the text information, and further completing the preprocessing of the text to be processed.

Step S202, determining a word segmentation tool to perform word segmentation processing on the preprocessed text information.

Step S203, calculating an importance value of each participle through a tf-idf algorithm, and sequencing the participles from large to small according to the importance value.

And step S204, extracting the top M participles as keywords.

Step S205, inputting the M keywords into the trained word vector model to obtain M word vectors.

In an embodiment, the word vector model is a word2vec model. Wherein, the dimension size of the output word feature vector is set to 100, and the maximum distance windows between the current word and the predicted word is set to 4.

And S206, aggregating the M word vectors into N classes through a K-means clustering algorithm, and further generating N synonym sets.

Step S207, judging whether the text information to be processed hasSynonym set N_iIf yes, go to step S208, otherwise go to step S209.

In step S208, the ith dimension of the N-dimensional vector is 1, and step S210 is performed.

In step S209, the ith dimension of the N-dimensional vector is 0, and step S210 is performed.

Step S210, determining whether the text information to be processed generates an N-dimensional vector through the synonym set, if so, performing step S211, otherwise, returning to step S207.

And S211, clustering the N-dimensional vectors based on a density clustering algorithm to obtain a classification result of the text information to be processed.

Step S212, the classification result is evaluated by calculating cosine similarity.

As another embodiment of the present invention, before performing the above steps S201 to S212, a large amount of text information that reflects the problem of the article may be collected, where the text information mainly originates from comment data on the article and dialogue data communicated with customer service through a communication window of the network transaction platform.

Collected text information can be preprocessed, and then word segmentation processing is carried out on the preprocessed text information so as to input the preprocessed text information into a word2vec model for training, so that a trained word2vec model is obtained.

It should be noted that the collected text information needs to be grouped according to the types of the articles, so that all the text information in each group is combined into one text information, that is, the text information is used as a file, and a corpus can be formed. For example, all the relevant text information of the business costume class is constructed into one corresponding file information F1, all the text information of the business household appliance class is constructed into another file information F2, and the file information of all the classes is constructed by analogy in turn, so that the construction of the corpus F is completed.

In step S203, an importance value may be calculated for each segmented word by a tf-idf algorithm based on the constructed corpus, so as to sort the segmented words from large to small according to the importance value. The specific implementation process comprises the following steps:

when the tf-idf is adopted to extract the keywords, the tf-idf formula is as follows:

tf-idf＝tf*idf

and tf is the word frequency of the keyword in the current file, and idf is the reverse document frequency of the keyword.

tf is calculated as follows:

wherein n is_i,jFor the number of times the keyword appears in the current file, Σ_kn_k,jIs the total word number of the current file.

The idf calculation is as follows:

wherein D is the total document number of the corpus, D_iThe total number of files containing the keyword.

And sorting according to the size of the tf-idf values, and taking the corresponding words of the first C tf-idf values as the keywords of the article class.

Fig. 3 is a schematic diagram of main blocks of an information processing apparatus according to an embodiment of the present invention, and as shown in fig. 3, the information processing apparatus 300 includes an acquisition module 301, a generation module 302, and a processing module 303. The obtaining module 301 obtains text information to be processed, and performs word segmentation processing to extract M keywords. The generating module 302 inputs M keywords into the trained word vector model to obtain M word vectors, so as to cluster the M word vectors to generate N synonym sets. The processing module 303 converts the text information to be processed into N-dimensional vectors based on the N synonym sets, and clusters the N-dimensional vectors to obtain a classification result of the text information to be processed.

In some embodiments, the obtaining module 301 obtains text information to be processed, so that the text information to be processed can be obtained when performing word segmentation processing, and content belonging to a preset character type in the text information is deleted, so as to complete preprocessing of the text to be processed; determining a word segmentation tool, and performing word segmentation processing on the preprocessed text information.

In some embodiments, when the obtaining module 301 extracts M keywords, an importance value is calculated for each participle through a tf-idf algorithm, and the participles are sorted from large to small according to the importance value; and extracting the first M participles as keywords.

As another embodiment, the word vector model is a word2vec model. The dimension of the output word feature vector is 100, and the maximum distance between the current word and the predicted word is 4.

It should be further noted that, when the generating module 302 clusters the M word vectors to generate N synonym sets, the M word vectors may be aggregated into N classes by using a K-means clustering algorithm, so as to generate N synonym sets.

It should be further noted that, when the processing module 303 converts the text information to be processed into an N-dimensional vector based on N synonym sets, it may determine whether there is a synonym set N in the text information to be processed_iIf so, the ith dimension of the N-dimensional vector is 1, otherwise, the ith dimension of the N-dimensional vector is 0; and repeating the process until the text information to be processed is coded into an N-dimensional vector.

In another embodiment, the processing module 303 clusters the N-dimensional vectors based on a density clustering algorithm to obtain a classification result of the text information to be processed; and evaluating the classification result by calculating cosine similarity.

It should be noted that the information processing method and the information processing apparatus according to the present invention have corresponding relation in the specific implementation contents, and therefore, the description of the repeated contents is omitted.

Fig. 4 shows an exemplary system architecture 400 to which the information processing method or the information processing apparatus of the embodiment of the present invention can be applied.

As shown in fig. 4, the system architecture 400 may include

terminal devices

401, 402, 403, a network 404, and a server 405. The network 404 serves as a medium for providing communication links between the

terminal devices

401, 402, 403 and the server 405. Network 404 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

A user may use

terminal devices

401, 402, 403 to interact with a server 405 over a network 404 to receive or send messages or the like. The

terminal devices

401, 402, 403 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

401, 402, 403 may be various electronic devices having information processing screens and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 405 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the

terminal devices

401, 402, 403. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.

It should be noted that the information processing method provided by the embodiment of the present invention is generally executed by the server 405, and accordingly, the computing device is generally disposed in the server 405.

It should be understood that the number of terminal devices, networks, and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 5, shown is a block diagram of a computer system 500 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM503, various programs and data necessary for the operation of the system 500 are also stored. The CPU501, ROM502, and RAM503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output section 507 including a display such as a Cathode Ray Tube (CRT), a liquid crystal information processor (LCD), and the like, and a speaker and the like; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 501.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes an acquisition module, a generation module, and a processing module. Wherein the names of the modules do not in some cases constitute a limitation of the module itself.

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: acquiring text information to be processed, and performing word segmentation processing to extract M keywords; inputting M keywords into a trained word vector model to obtain M word vectors, and clustering the M word vectors to generate N near-meaning word sets; and converting the text information to be processed into N-dimensional vectors based on the N synonym sets, and clustering the N-dimensional vectors to obtain a classification result of the text information to be processed.

According to the technical scheme of the embodiment of the invention, the problems that the classification of the quality problem of the existing articles mainly depends on manpower, the cost is high and the efficiency is low can be solved.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An information processing method characterized by comprising:

acquiring text information to be processed, and performing word segmentation processing to extract M keywords;

inputting M keywords into a trained word vector model to obtain M word vectors, and clustering the M word vectors to generate N near-meaning word sets;

and converting the text information to be processed into N-dimensional vectors based on the N synonym sets, and clustering the N-dimensional vectors to obtain a classification result of the text information to be processed.

2. The method of claim 1, wherein obtaining text information to be processed for word segmentation comprises:

3. The method of claim 1, wherein extracting M keywords comprises:

and extracting the first M participles as keywords.

4. The method of claim 1, comprising:

5. The method of claim 1, wherein clustering the M word vectors generates N sets of near words, comprising:

6. The method of claim 1, wherein converting the text information to be processed into an N-dimensional vector based on N sets of near-sense words comprises:

judging whether the text information to be processed has a similar meaning word set N_iIf so, the ith dimension of the N-dimensional vector is 1, otherwise, the ith dimension of the N-dimensional vector is 0; and repeating the process until the text information to be processed is coded into an N-dimensional vector.

7. The method of any of claims 1-6, further comprising:

and evaluating the classification result by calculating cosine similarity.

8. An information processing apparatus characterized in that,

the acquisition module is used for acquiring text information to be processed and performing word segmentation processing to extract M keywords;

the generating module is used for inputting M keywords into the trained word vector model to obtain M word vectors so as to cluster the M word vectors to generate N near-meaning word sets;

and the processing module is used for converting the text information to be processed into N-dimensional vectors based on the N synonym sets, and clustering the N-dimensional vectors to obtain a classification result of the text information to be processed.

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.