CN113535891A - Internet short text topic feature and emotional tendency analysis method, system and medium - Google Patents

Internet short text topic feature and emotional tendency analysis method, system and medium Download PDF

Info

Publication number
CN113535891A
CN113535891A CN202110632146.8A CN202110632146A CN113535891A CN 113535891 A CN113535891 A CN 113535891A CN 202110632146 A CN202110632146 A CN 202110632146A CN 113535891 A CN113535891 A CN 113535891A
Authority
CN
China
Prior art keywords
word
characteristic
words
similarity
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110632146.8A
Other languages
Chinese (zh)
Inventor
郭浩哲
蒙圣光
廖玉敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Fastersoft Software Co ltd
Original Assignee
Guangdong Fastersoft Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Fastersoft Software Co ltd filed Critical Guangdong Fastersoft Software Co ltd
Priority to CN202110632146.8A priority Critical patent/CN113535891A/en
Publication of CN113535891A publication Critical patent/CN113535891A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method, a system and a medium for analyzing topic features and emotional tendency of Internet short texts, wherein the method comprises the following steps: collecting an internet OTA resource object and evaluation information thereof; performing word segmentation and clustering on the OTA evaluation information to obtain theme characteristics; extracting high-frequency words in the participles under the characteristic dimension of the clustering analysis theme, calculating the emotional tendency and the characteristic tendency of the participles, and classifying a characteristic word bank and an emotional word bank; screening out a feature field stop word library; establishing a synonym forest; dividing the evaluation information into short sentences, and performing word segmentation, synonym forest processing and stop word processing; calculating an emotion vector of a sentence, and calculating an emotion tendency through a support vector machine; determining the characteristic tendency of the word segmentation and determining the characteristic theme of the short sentence; and outputting the characteristic theme and the comprehensive emotional tendency of the evaluation information. The invention can accurately analyze the Internet evaluation theme and the industry public praise level.

Description

Internet short text topic feature and emotional tendency analysis method, system and medium
Technical Field
The invention relates to the technical field of data processing, in particular to a method, a system and a medium for analyzing topic features and emotional tendency of Internet short texts.
Background
Currently, a snowNLP word stock is mainly used for evaluating emotion analysis, a lower positive emotion dictionary stock and a lower negative emotion dictionary stock of the snowNLP word stock are compared by repeatedly iterating a word segmentation dictionary and an emotion dictionary according to word segmentation results to obtain an emotion word list, and the occurrence frequency of positive and negative emotion words is counted to evaluate the emotion tendency. However, the existing emotional tendency analysis method ignores the feature dimension emotional analysis of the important attention and the common scoring elements in the internet evaluation.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides an Internet short text theme feature and emotional tendency analysis method, which can accurately analyze Internet evaluation themes and industry public praise levels.
The invention further provides an Internet short text topic feature and emotional tendency analysis system.
The invention also provides a computer readable storage medium for implementing the Internet short text topic feature and emotional tendency analysis method.
According to the embodiment of the first aspect of the invention, the method for analyzing the topic features and emotional tendency of the Internet short texts comprises the following steps: s100, collecting an Internet OTA resource object and evaluation information thereof through a python web crawler, inputting the resource object into a database, and normalizing the resource objects of different platforms; s200, performing word segmentation on the OTA evaluation information, clustering according to the similarity of word segmentation results to obtain various classified feature words, and obtaining theme features according to the various classified feature words; s300, extracting high-frequency words in the participles under the characteristic dimension of the clustering analysis theme, calculating the emotional tendency and the characteristic tendency of the high-frequency words based on KNN, and classifying a characteristic word bank and an emotional word bank; recording the characteristic word library as a domain keyword library, and screening out a characteristic domain stop word library according to the characteristic similarity; establishing a synonym forest based on the similarity between the vocabularies; s400, inputting complete OTA evaluation information, splitting the evaluation information into short sentences, filtering the short sentences which do not contain domain feature keywords, and performing word segmentation, synonym forest processing and stop word processing on the short sentences containing domain keywords; s500, obtaining emotion word vectors through the vocabulary similarity and the emotion word library, calculating the emotion vectors of the sentences, and calculating the emotion tendencies through a support vector machine; s600, obtaining the characteristic tendency of the participle through the vocabulary similarity and the characteristic word bank, and determining the characteristic theme of the short sentence through statistics; and S700, outputting the characteristic theme and the comprehensive emotional tendency of the evaluation information.
The method for analyzing the topic characteristics and the emotional tendency of the Internet short texts according to the embodiment of the invention at least has the following beneficial effects: the method provided by the embodiment of the invention not only evaluates the emotional tendency through the positive and negative emotion words, but also considers the characteristic theme of the evaluation information, can identify the internet evaluation theme and quantify the theme characteristic emotion, deeply and accurately excavates the idea of user evaluation, analyzes the industry development public praise, and provides data support for the industry development scientific decision.
According to some embodiments of the invention, said step S100 comprises: and associating and matching the objects of the platforms according to the name similarity, the address similarity and the specific coordinates.
According to some embodiments of the invention, said step S200 comprises: dividing words of the OTA evaluation information by means of jieba word division, storing the words into an associated word division library according to sentence association, and storing every two associated word divisions into the associated word division library as new words; inputting the word segmentation result into a word2vec model for training at space intervals among the segmented words by taking sentences as units to obtain a trained word similarity comparison model; and comparing the word segmentation result with the similarity through word2vec, putting the word segmentation result into a k-means model for classification according to the word similarity, extracting the characteristic words of the class from the classification result, and combining with an industry standard to obtain the final theme characteristic.
According to some embodiments of the invention, said step S300 comprises: extracting high-frequency words in the participles under each topic characteristic dimension, dividing the emotional tendency into a plurality of levels, then calculating the emotion/characteristic tendency of the high-frequency words based on KNN, classifying a characteristic word bank and an emotion word bank, training each characteristic word bank trained by the KNN as a domain keyword bank by using a word2vec model to form a participle similarity model vector, and setting the words with the difference between the first similarity and the second similarity not exceeding a threshold value as a characteristic domain stop word bank; and calculating the similarity among the vocabularies based on the trained word2vec, and establishing a synonym forest when the words with the similarity exceeding a set threshold are considered as synonyms.
According to some embodiments of the invention, the step S500 comprises: obtaining a similarity array of nearest neighbor words of the participle by using word2vec, comparing each nearest neighbor word array with a plurality of levels of emotion word banks, if words with complete consistency or similarity exceeding a set threshold value exist, considering the emotion level of the participle as the emotion level corresponding to the emotion word bank, and obtaining an emotion word vector through the nearest neighbor words.
According to some embodiments of the invention, the step S500 comprises: and if the nearest neighbor words have the words of the feature keywords or the synonym forest thereof, the value of the emotional word vector is doubled.
According to some embodiments of the invention, the calculating an emotion vector of the sentence comprises: and linearly adding the word segmentation emotion vectors to obtain the emotion vector of the sentence.
According to some embodiments of the invention, the step S600 comprises: performing nearest neighbor matching (KNN) on each participle in all words in a feature word bank through word2vec, setting a threshold, if the number of words exceeding the threshold does not exceed K, ignoring the word, and finally passing the feature that the word belongs to the most nearest neighbor words; and counting the word segmentation number of each feature in the short sentence, wherein the feature with the largest number is the feature theme of the short sentence.
According to some embodiments of the invention, the method further comprises: and when the emotional tendency is evaluated, combining the degree word quantization short sentence emotional scores, adding the user evaluation original scores, respectively setting the weights of 0.5, and calculating the comprehensive emotional score.
The system for analyzing the topic features and emotional tendency of the internet short texts according to the embodiment of the second aspect of the present invention is used for implementing the method according to any one of the embodiments of the first aspect of the present invention, and comprises: the information acquisition module is used for acquiring the Internet OTA resource object and evaluation information thereof through a python web crawler; the topic feature module is used for segmenting the OTA evaluation information, clustering according to the similarity of segmentation results to obtain various classified feature words and obtain topic features according to the various classified feature words; the word bank establishing module is used for extracting high-frequency words in the participles under the characteristic dimension of the clustering analysis theme, calculating the emotional tendency and the characteristic tendency of the high-frequency words based on KNN, and classifying a characteristic word bank and an emotional word bank; recording the characteristic word library as a domain keyword library, and screening out a characteristic domain stop word library according to the characteristic similarity; establishing a synonym forest based on the similarity between the vocabularies; the information input module is used for inputting complete OTA evaluation information, splitting the evaluation information into short sentences, filtering the short sentences which do not contain domain feature keywords, and performing word segmentation, synonym forest processing and stop word processing on the short sentences containing domain keywords; the emotion tendency module is used for obtaining emotion word vectors through the vocabulary similarity and the emotion word bank, calculating the emotion vectors of the sentences, and then calculating the emotion tendency through a support vector machine; the characteristic theme module is used for obtaining the characteristic tendency of the participle through the vocabulary similarity and the characteristic word stock and determining the characteristic theme of the short sentence through statistics; and the output module is used for outputting the characteristic theme and the comprehensive emotional tendency of the evaluation information.
The system for analyzing the topic characteristics and the emotional tendency of the Internet short texts according to the embodiment of the invention at least has the following beneficial effects: the system of the embodiment of the invention not only evaluates the emotional tendency through the positive and negative emotion words, but also considers the characteristic theme of the evaluation information, can identify the internet evaluation theme and quantify the theme characteristic emotion, deeply and accurately excavates the idea of user evaluation, analyzes the industry development public praise, and provides data support for the industry development scientific decision.
A computer-readable storage medium according to an embodiment of the third aspect of the invention, having stored thereon a computer program which, when executed by a processor, carries out the method of any one of the first method embodiments of the invention.
All the advantages of the first aspect of the present invention are achieved because the computer-readable storage medium of the embodiment of the present invention stores computer-executable instructions for executing the method for analyzing internet short text topic characteristics and emotional tendency according to any one of the first aspect of the present invention.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention.
FIG. 2 is a block diagram of the modules of the system of an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
In the description of the present invention, the meaning of a plurality of means is one or more, the meaning of a plurality of means is two or more, and more than, less than, more than, etc. are understood as excluding the present number, and more than, less than, etc. are understood as including the present number. If the first and second are described for the purpose of distinguishing technical features, they are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
Referring to fig. 1, the invention provides an internet short text topic feature and emotional tendency analysis method, which mainly comprises the following steps:
(1) and collecting the resource objects of the internet OTA scenic spots/hotels and the related evaluation information by a python web crawler and inputting the objects into a database. And then, the objects of all the platforms are associated and matched according to the names, the address similarity and the specific coordinates, and the resource objects of different platforms are normalized as much as possible.
For example: the Guangzhou Changlong amusement park has the problems that the Guangzhou Changlong amusement park, the Guangzhou Changlong amusement park dodgem project, the Guangzhou Changlong amusement park ten-ring roller coaster project and the like appear on the OTA, the names, the addresses, the brief introduction and the coordinates are very similar, and the Guangzhou Changlong amusement park is merged into the same resource;
(2) dividing the OTA comments of the scenic spot/hotel by the jieba word division, storing the words in a warehouse according to sentence association, and storing every two adjacent word divisions as new words in the warehouse;
for example: the words of 'service really good', the words of 'service', 'real', 'good', 'service real', 'real good' are entered into the word stock, and the words of 'service really good' are entered into the related word stock as related words;
(3) training the word segmentation result through word2vec to form a word segmentation similarity calculation model; the word segmentation result is input into a word2vec model for training by taking a sentence as a unit and space intervals among the segmented words (if the word is a sentence, the word feels very good), and the word2vec can form a good word similarity comparison model through training of hundreds of thousands of sentences;
(4) comparing similarity of OTA word segmentation results of scenic spots/hotels by word2vec, putting the words into a k-means model according to the similarity of words and classifying the words into 8/6 classes, manually extracting feature words most suitable for the class from classification results, and obtaining final theme features (the final features are that the scenic spots comprise informatization, items, facilities, landscape, traffic, price, service and tour guide, and the hotels comprise traffic, positions, facilities, price, environment and service) by combining with industry standards;
(5) extracting the high-frequency words with the word frequency of 5000 before the word frequency under the characteristic dimension of the clustering analysis theme. The emotional tendency is divided into four levels of extreme positive, negative and extreme negative, and the characteristic classification is classified according to the four levels. Calculating emotion/feature tendency of the high-frequency words based on KNN (four levels of the emotion words respectively select very good, poor and very poor as first-matched central words), classifying a feature word bank and an emotion word bank, training the word bank of each feature field by using a KNN as a field keyword bank, training by using a word2vec model to form a word segmentation similarity model vector, and setting words with the difference between the first similarity and the second similarity being not more than 15% as a feature field stop word bank.
For example: if the difference between the 'good' and the 'environment' is not more than 15%, the 'good' is set as a stop word of the characteristic field;
(6) using the trained word2vec to judge the similarity between the vocabularies, considering the words with the similarity exceeding 70% as synonyms, and establishing a synonym forest;
(7) performing word segmentation, synonym forest processing and stop word processing on short sentences containing the keywords;
(8) when the tendency of a word is judged, word2vec is used for obtaining the similarity array of the nearest neighbor word of the word, each nearest neighbor word array is compared with four emotion level word banks, if words with the similarity exceeding 70% exist in the word banks or the word banks, the words are considered to belong to the emotion level, finally, emotion word vectors are obtained through the nearest neighbor words, and if words with characteristic keywords or synonym forest of the characteristic keywords exist in the emotion word vectors, the vector values are doubled. And (3) the emotion vectors of the sentences are directly added linearly to each participle emotion vector, and finally the emotion vectors calculate the emotion tendency of the participle emotion vectors through a support vector machine. The word vector can explain words in a more multidimensional way, and the characteristic tendency can be more discriminative by doubling the weight of the characteristic key words;
for example: the comment content "the server is very well attitude, or is somewhat remote", and after the word segmentation is "service/attitude/very good/is somewhat/remote". The "serving" nearest neighbor word is: [ serve, reception. ] with the word emotion vector converted to: [ extreme positive: 0, front surface: 2, negative: 1, extreme negative: 0], and similarly, the emotion vector of the attitude is: [ extreme positive: 0, front surface: 2, negative: 1, extreme negative: 0], "very good" emotion vector is: [ extreme positive: 2, front surface: 2, negative: 0, extreme negative: 0], "is": [ extreme positive: 0, front surface: 0, negative: 0, extreme negative: 0], "has a point": [ extreme positive: 0, front surface: 0, negative: 0, extreme negative: 0], "remote": [ extreme positive: 0, front surface: 0, negative: 3, extreme negative: 0], the whole sentence emotion vector is: [ extreme positive: 2, front surface: 6, negative: 5, extreme negative: 0], obtaining favorable emotional tendency through a support vector machine;
(9) judging the characteristic theme of the short sentence, similarly, carrying out nearest neighbor matching on each participle, finding out a nearest neighbor word array, then comparing each nearest neighbor word with each characteristic word bank through word2vec, if the words are completely consistent or have the similarity exceeding 70%, determining that the words belong to the word bank, otherwise, determining that the participle has no characteristic tendency. Counting the number of participles of each feature in the short sentence, wherein the feature with the largest number is the feature theme of the short sentence;
for example: the comment content "the server is very well attitude, or is somewhat remote", and after the word segmentation is "service/attitude/very good/is somewhat/remote". Characteristic trends are exemplified: "the attitude of the waiter is very good" - > "the service/the attitude/the service is very good", "the service" is included in the "service" feature word, the feature is "the service", "the attitude" is included in the "service" feature word, the feature is "the service", "the very good" nearest neighbor word is [ good, good bar. ] and the comparison similarity with each feature word bank is less than 70%, and no feature exists. Therefore, the final word of the service is divided into 2 words at most, and the short sentence is considered to belong to the service characteristic. The ' rare word-the-point- > ' is/the rare word-the-point/the rare word ', namely the ' nearest neighbor word ' is that the comparison similarity between the ' rare word-the-point ' and each feature word library is less than 70%, the ' rare word ' belongs to the ' position ', the final ' position ' word is most, 1 word, and the short sentence is considered to belong to the ' position '. The final comment content is [ service: 1, position: 1], and the comment short sentence number of the service is more, and the comment is considered to belong to the service characteristic.
(10) And (4) inputting a complete OTA evaluation into the finally formed analysis model, and automatically positioning the most similar characteristic theme for the analysis model by the analysis model and giving comprehensive emotional tendency (good/medium/bad evaluation) of the evaluation.
For example: the comment content 'waiter attitude is very good, namely is a little remote', and the final model output result is that the comment characteristic 'service' comments the emotion and is commented.
Corresponding to the foregoing embodiments, the present invention also provides system embodiments. For the system embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points.
Referring to fig. 2, a system of an embodiment of the present invention includes: the information acquisition module is used for acquiring the Internet OTA resource object and evaluation information thereof through a python web crawler; the topic feature module is used for segmenting the OTA evaluation information, clustering according to the similarity of segmentation results to obtain various classified feature words and obtain topic features according to the various classified feature words; the word bank establishing module is used for extracting high-frequency words in the participles under the characteristic dimension of the clustering analysis theme, calculating the emotional tendency and the characteristic tendency of the high-frequency words based on KNN, and classifying a characteristic word bank and an emotional word bank; recording the feature word library as a domain keyword library, and screening out a feature domain stop word library according to the feature similarity; establishing a synonym forest based on the similarity between the vocabularies; the information input module is used for inputting complete OTA evaluation information, splitting the evaluation information into short sentences, filtering the short sentences which do not contain domain feature keywords, and performing word segmentation, synonym forest processing and stop word processing on the short sentences containing domain keywords; the emotion tendency module is used for obtaining emotion word vectors through the vocabulary similarity and the emotion word library, calculating the emotion vectors of the sentences, and then calculating the emotion tendency through the support vector machine; the characteristic theme module is used for obtaining the characteristic tendency of the participle through the vocabulary similarity and the characteristic word bank and determining the characteristic theme of the short sentence through statistics; and the output module is used for outputting the characteristic theme and the comprehensive emotional tendency of the evaluation information.
Although specific embodiments have been described herein, those of ordinary skill in the art will recognize that many other modifications or alternative embodiments are equally within the scope of this disclosure. For example, any of the functions and/or processing capabilities described in connection with a particular device or component may be performed by any other device or component. In addition, while various illustrative implementations and architectures have been described in accordance with embodiments of the present disclosure, those of ordinary skill in the art will recognize that many other modifications of the illustrative implementations and architectures described herein are also within the scope of the present disclosure.
Certain aspects of the present disclosure are described above with reference to block diagrams and flowchart illustrations of systems, methods, systems, and/or computer program products according to example embodiments. It will be understood that one or more blocks of the block diagrams and flowchart illustrations, and combinations of blocks in the block diagrams and flowchart illustrations, respectively, can be implemented by executing computer-executable program instructions. Also, according to some embodiments, some blocks of the block diagrams and flow diagrams may not necessarily be performed in the order shown, or may not necessarily be performed in their entirety. In addition, additional components and/or operations beyond those shown in the block diagrams and flow diagrams may be present in certain embodiments.
Accordingly, blocks of the block diagrams and flowchart illustrations support combinations of means for performing the specified functions, combinations of elements or steps for performing the specified functions and program instruction means for performing the specified functions. It will also be understood that each block of the block diagrams and flowchart illustrations, and combinations of blocks in the block diagrams and flowchart illustrations, can be implemented by special purpose hardware-based computer systems that perform the specified functions, elements or steps, or combinations of special purpose hardware and computer instructions.
Program modules, applications, etc. described herein may include one or more software components, including, for example, software objects, methods, data structures, etc. Each such software component may include computer-executable instructions that, in response to execution, cause at least a portion of the functionality described herein (e.g., one or more operations of the illustrative methods described herein) to be performed.
The software components may be encoded in any of a variety of programming languages. An illustrative programming language may be a low-level programming language, such as assembly language associated with a particular hardware architecture and/or operating system platform. Software components that include assembly language instructions may need to be converted by an assembler program into executable machine code prior to execution by a hardware architecture and/or platform. Another exemplary programming language may be a higher level programming language, which may be portable across a variety of architectures. Software components that include higher level programming languages may need to be converted to an intermediate representation by an interpreter or compiler before execution. Other examples of programming languages include, but are not limited to, a macro language, a shell or command language, a job control language, a scripting language, a database query or search language, or a report writing language. In one or more exemplary embodiments, a software component containing instructions of one of the above programming language examples may be executed directly by an operating system or other software component without first being converted to another form.
The software components may be stored as files or other data storage constructs. Software components of similar types or related functionality may be stored together, such as in a particular directory, folder, or library. Software components may be static (e.g., preset or fixed) or dynamic (e.g., created or modified at execution time).
The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims (10)

1. An Internet short text topic feature and emotional tendency analysis method is characterized by comprising the following steps:
s100, collecting an Internet OTA resource object and evaluation information thereof through a python web crawler, inputting the resource object into a database, and normalizing the resource objects of different platforms;
s200, performing word segmentation on the OTA evaluation information, clustering according to the similarity of word segmentation results to obtain various classified feature words, and obtaining theme features according to the various classified feature words;
s300, extracting high-frequency words in the participles under the characteristic dimension of the clustering analysis theme, calculating the emotional tendency and the characteristic tendency of the high-frequency words based on KNN, and classifying a characteristic word bank and an emotional word bank; recording the characteristic word library as a domain keyword library, and screening out a characteristic domain stop word library according to the characteristic similarity; establishing a synonym forest based on the similarity between the vocabularies;
s400, inputting complete OTA evaluation information, splitting the evaluation information into short sentences, filtering the short sentences which do not contain domain feature keywords, and performing word segmentation, synonym forest processing and stop word processing on the short sentences containing domain keywords;
s500, obtaining emotion word vectors through the vocabulary similarity and the emotion word library, calculating the emotion vectors of the sentences, and calculating the emotion tendencies through a support vector machine;
s600, obtaining the characteristic tendency of the participle through the vocabulary similarity and the characteristic word bank, and determining the characteristic theme of the short sentence through statistics;
and S700, outputting the characteristic theme and the comprehensive emotional tendency of the evaluation information.
2. The method for analyzing topic features and emotional tendencies of Internet short texts according to claim 1, wherein the step S100 comprises: and associating and matching the objects of the platforms according to the name similarity, the address similarity and the specific coordinates.
3. The method for analyzing topic features and emotional tendencies of Internet short texts according to claim 1, wherein the step S200 comprises:
dividing words of the OTA evaluation information by means of jieba word division, storing the words into an associated word division library according to sentence association, and storing every two associated word divisions into the associated word division library as new words;
inputting the word segmentation result into a word2vec model for training at space intervals among the segmented words by taking sentences as units to obtain a trained word similarity comparison model;
and comparing the word segmentation result with the similarity through word2vec, putting the word segmentation result into a k-means model for classification according to the word similarity, extracting the characteristic words of the class from the classification result, and combining with an industry standard to obtain the final theme characteristic.
4. The method for analyzing topic features and emotional tendencies of Internet short texts according to claim 1, wherein the step S300 comprises:
extracting high-frequency words in the participles under each topic characteristic dimension, dividing the emotional tendency into a plurality of levels, then calculating the emotion/characteristic tendency of the high-frequency words based on KNN, classifying a characteristic word bank and an emotion word bank, training each characteristic word bank trained by the KNN as a domain keyword bank by using a word2vec model to form a participle similarity model vector, and setting the words with the difference between the first similarity and the second similarity not exceeding a threshold value as a characteristic domain stop word bank;
and calculating the similarity among the vocabularies based on the trained word2vec, and establishing a synonym forest when the words with the similarity exceeding a set threshold are considered as synonyms.
5. The method for analyzing topic features and emotional tendencies of Internet short texts according to claim 1, wherein the step S500 comprises: obtaining a similarity array of nearest neighbor words of the participle by using word2vec, comparing each nearest neighbor word array with a plurality of levels of emotion word banks, if words with complete consistency or similarity exceeding a set threshold value exist, considering the emotion level of the participle as the emotion level corresponding to the emotion word bank, and obtaining an emotion word vector through the nearest neighbor words.
6. The method for analyzing topic features and emotional tendencies of Internet short texts according to claim 5, wherein said step S500 comprises: and if the nearest neighbor words have the words of the feature keywords or the synonym forest thereof, the value of the emotional word vector is doubled.
7. The method of claim 1, wherein the computing the emotion vector of the sentence comprises: and linearly adding the word segmentation emotion vectors to obtain the emotion vector of the sentence.
8. The method for analyzing topic features and emotional tendencies of Internet short texts according to claim 1, wherein the step S600 comprises:
performing nearest neighbor matching on each participle in all words in a feature word bank through word2vec, setting a threshold, if the number of words exceeding the threshold does not exceed K, ignoring the word, and finally passing the feature that the word belongs to the most nearest neighbor words;
and counting the word segmentation number of each feature in the short sentence, wherein the feature with the largest number is the feature theme of the short sentence.
9. An internet short text topic feature and emotional tendency analysis system for implementing the method as claimed in any one of claims 1 to 8, comprising:
the information acquisition module is used for acquiring the Internet OTA resource object and evaluation information thereof through a python web crawler;
the topic feature module is used for segmenting the OTA evaluation information, clustering according to the similarity of segmentation results to obtain various classified feature words and obtain topic features according to the various classified feature words;
the word bank establishing module is used for extracting high-frequency words in the participles under the characteristic dimension of the clustering analysis theme, calculating the emotional tendency and the characteristic tendency of the high-frequency words based on KNN, and classifying a characteristic word bank and an emotional word bank; recording the characteristic word library as a domain keyword library, and screening out a characteristic domain stop word library according to the characteristic similarity; establishing a synonym forest based on the similarity between the vocabularies;
the information input module is used for inputting complete OTA evaluation information, splitting the evaluation information into short sentences, filtering the short sentences which do not contain domain feature keywords, and performing word segmentation, synonym forest processing and stop word processing on the short sentences containing domain keywords;
the emotion tendency module is used for obtaining emotion word vectors through the vocabulary similarity and the emotion word bank, calculating the emotion vectors of the sentences, and then calculating the emotion tendency through a support vector machine;
the characteristic theme module is used for obtaining the characteristic tendency of the participle through the vocabulary similarity and the characteristic word stock and determining the characteristic theme of the short sentence through statistics;
and the output module is used for outputting the characteristic theme and the comprehensive emotional tendency of the evaluation information.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 8.
CN202110632146.8A 2021-06-07 2021-06-07 Internet short text topic feature and emotional tendency analysis method, system and medium Pending CN113535891A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110632146.8A CN113535891A (en) 2021-06-07 2021-06-07 Internet short text topic feature and emotional tendency analysis method, system and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110632146.8A CN113535891A (en) 2021-06-07 2021-06-07 Internet short text topic feature and emotional tendency analysis method, system and medium

Publications (1)

Publication Number Publication Date
CN113535891A true CN113535891A (en) 2021-10-22

Family

ID=78124610

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110632146.8A Pending CN113535891A (en) 2021-06-07 2021-06-07 Internet short text topic feature and emotional tendency analysis method, system and medium

Country Status (1)

Country Link
CN (1) CN113535891A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727487A (en) * 2009-12-04 2010-06-09 中国人民解放军信息工程大学 Network criticism oriented viewpoint subject identifying method and system
CN104268197A (en) * 2013-09-22 2015-01-07 中科嘉速(北京)并行软件有限公司 Industry comment data fine grain sentiment analysis method
US20170109633A1 (en) * 2015-10-15 2017-04-20 Sap Se Comment-comment and comment-document analysis of documents
CN107239439A (en) * 2017-04-19 2017-10-10 同济大学 Public sentiment sentiment classification method based on word2vec
CN109977413A (en) * 2019-03-29 2019-07-05 南京邮电大学 A kind of sentiment analysis method based on improvement CNN-LDA
CN110442728A (en) * 2019-06-28 2019-11-12 天津大学 Sentiment dictionary construction method based on word2vec automobile product field
CN111078894A (en) * 2019-12-17 2020-04-28 中国科学院遥感与数字地球研究所 Scenic spot evaluation knowledge base construction method based on metaphor topic mining

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727487A (en) * 2009-12-04 2010-06-09 中国人民解放军信息工程大学 Network criticism oriented viewpoint subject identifying method and system
CN104268197A (en) * 2013-09-22 2015-01-07 中科嘉速(北京)并行软件有限公司 Industry comment data fine grain sentiment analysis method
US20170109633A1 (en) * 2015-10-15 2017-04-20 Sap Se Comment-comment and comment-document analysis of documents
CN107239439A (en) * 2017-04-19 2017-10-10 同济大学 Public sentiment sentiment classification method based on word2vec
CN109977413A (en) * 2019-03-29 2019-07-05 南京邮电大学 A kind of sentiment analysis method based on improvement CNN-LDA
CN110442728A (en) * 2019-06-28 2019-11-12 天津大学 Sentiment dictionary construction method based on word2vec automobile product field
CN111078894A (en) * 2019-12-17 2020-04-28 中国科学院遥感与数字地球研究所 Scenic spot evaluation knowledge base construction method based on metaphor topic mining

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黎巎,谢宗彦,张公鹏,郝志成,向征: "基于LDA的游客网络评论主题分类:以故宫为例", TECHNOLOGY INTELLIGENCE ENGINEERING, vol. 3, no. 3, 31 December 2017 (2017-12-31), pages 55 - 63 *

Similar Documents

Publication Publication Date Title
CN110196901B (en) Method and device for constructing dialog system, computer equipment and storage medium
US7599926B2 (en) Reputation information processing program, method, and apparatus
KR101312770B1 (en) Information classification paradigm
CN109783631B (en) Community question-answer data verification method and device, computer equipment and storage medium
CN104471568A (en) Learning-based processing of natural language questions
CN112800170A (en) Question matching method and device and question reply method and device
US20230069935A1 (en) Dialog system answering method based on sentence paraphrase recognition
CN110472203B (en) Article duplicate checking and detecting method, device, equipment and storage medium
CN112347244A (en) Method for detecting website involved in yellow and gambling based on mixed feature analysis
CN109858626B (en) Knowledge base construction method and device
CN112613293B (en) Digest generation method, digest generation device, electronic equipment and storage medium
CN112256845A (en) Intention recognition method, device, electronic equipment and computer readable storage medium
CN112380848B (en) Text generation method, device, equipment and storage medium
CN107357765A (en) Word document flaking method and device
CN111274822A (en) Semantic matching method, device, equipment and storage medium
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN107844531B (en) Answer output method and device and computer equipment
KR102206781B1 (en) Method of fake news evaluation based on knowledge-based inference, recording medium and apparatus for performing the method
CN115757743A (en) Document search term matching method and electronic equipment
US8666987B2 (en) Apparatus and method for processing documents to extract expressions and descriptions
CN111492364A (en) Data labeling method and device and storage medium
CN115827867A (en) Text type detection method and device
CN112989001B (en) Question and answer processing method and device, medium and electronic equipment
CN111611394B (en) Text classification method and device, electronic equipment and readable storage medium
CN113535891A (en) Internet short text topic feature and emotional tendency analysis method, system and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination