CN106709824B

CN106709824B - Building evaluation method based on semantic analysis of web text

Info

Publication number: CN106709824B
Application number: CN201611159450.0A
Authority: CN
Inventors: 赵渺希; 郭振松; 梁景宇
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2016-12-15
Filing date: 2016-12-15
Publication date: 2020-07-28
Anticipated expiration: 2036-12-15
Also published as: CN106709824A

Abstract

The invention discloses a building evaluation method based on network text semantic analysis, which comprises the steps of selecting a professional building forum, obtaining network texts by utilizing L ocoy Spider software, carrying out screening and sorting, carrying out semantic analysis on the network texts by using a result word segmentation tool and a Chinese word frequency analysis tool, carrying out screening matching and non-parameter inspection on word frequency tables of segmentation classes of a modern Chinese language database, establishing a network building professional language database, carrying out characteristic word analysis on building individuals, comparing characteristic words of the building individuals with the network building professional language database, and analyzing the attention difference of the network individuals and professional building designers on the building individuals.

Description

Building evaluation method based on semantic analysis of web text

Technical Field

The invention relates to a building evaluation method, in particular to a building evaluation method based on web text semantic analysis, and belongs to the field of building evaluation.

Background

With the advent of the information age and the network society, the variety of construction media has become increasingly abundant. Besides traditional text publishing media such as newspapers and magazines, the rise of new media such as social software, professional building forums and sticking bars provides new platforms and tools for building comments. In recent years, a lot of nickname buildings similar to 'autumn pants', 'big underpants' and 'big intestine tower' are concerned in the network, attract the broad enthusiasm of netizens and social people, and raise a round of building criticism, thereby having a wide influence on building design and building comments. Diversified building propagation media play an increasingly important role in the field of building reviews, and have profound influence on the main body, content, form, value standard and the like of the building reviews^[1]. In the role of the current network new media in the building field, the difference of the cognition of different groups such as designers, masses and the like to the building and the effective promotion of the public participation of the building design by using the network media tool of the new era are subjects worthy of intensive research.

With the continuous improvement of information technology, methods for word frequency analysis, semantic analysis and comment tendency analysis are becoming mature. Zhangming et al (2009) invented a Chinese web page classification method based on keyword frequency analysis, which uses regular expression filter to filter noise, uses word segmentation device and keyword frequency analyzer to make fuzzy classification calculation of web page to obtain the result of the category to which the web page belongs^[1](ii) a Wangyi et al (2013) invented a semantic analysis method and system, which carries out corpus segmentation and iterative sampling according to document dimensionality and word dimensionality, and carries out semantic analysis on the obtained convergence sampling model^[2](ii) a Shiliu (2014) invents a method and a device for extracting domain keywords, and keywords in the domain are extracted by setting an algorithm through generating a word frequency matrix^[3](ii) a Zhao Juxi et al (2016) invented a method for generating an urban cognition map based on internet word frequency, which is reflected on the urban map based on the urban cognition measure collected by network data^[4](ii) a Wu Qiong et al (2009) invented a cross-domain text emotion orientation analysis method, which establishes a matrix relation through a text set, calculates emotion scores by using a matrix and normalizes^[5](ii) a The limited scientific and technological development company (2011) of Zhongdingfu (Beijing) invented a system and method for analyzing tendentiousness of short texts, which can identify semantic structures of sentences, search set tendentiousness words and tendentiousness patterns in the sentences and analyze the tendentiousness^[6]. Wumingfen et al (2013) invent an automatic classification system for oriented texts and an implementation method thereof, and classify texts based on an emotion classification syntax tree library and a dependency relationship graph library^[7](ii) a Donelili et al (2013) invented a text tendency analysis method and commodity comment tendency discriminator based on the method, through dependency grammar analysis, emotion dictionary calculation engine discriminates text tendency^[8](ii) a The invention discloses a method and a device for determining text tendency, which are used for determining the tendency of a sentence containing an industry characteristic word according to a preset industry characteristic word dictionary and a text classification model^[9]。

Therefore, the web text pair is used for establishing a professional building corpus and researching the tendency of the public to different building schemes, so that more building comment languages are reflected in the building design, and the development of building evaluation and building design is promoted.

The references mentioned above are as follows:

[1] zhangming, ridge dragon, Luyanhong, Von source, Yanrui, Wang Pan. Patent application publication No. CN101593200, 2009-12-02.

[2] Wangyi, zhao schanmin, sun jonglong, rigor, wangli peak, junk, royal bin semantic analysis method and system [ P ]. guangdong: patent application publication No. CN104346339A, 2015-02-11.

[3] Shiliu, a method and device for extracting domain keywords [ P ]. Beijing: patent application publication numbers CN103870575A, 2014-06-18.

[4] Zhao vast xi, Huangjunhao, Linyan willow, zhongguang, an urban cognitive map generation method [ P ] based on internet word frequency: patent application publication No. CN105574259A, 2016-05-11.

[5] Wu Qiong, Tan Tubo, section persistence, Cheng Zhi, a cross-domain text emotional orientation analysis method [ P ]. Beijing: patent application publication No. CN101714135A, 2010-05-26.

[6] System and method for trend analysis for short text [ P ]. beijing: patent application publication No. CN102541840A, 2012-07-04.

[7] Wumingfen, Chentao, Liuxing forest, an automatic classification system of tendency texts and an implementation method [ P ]. Guangdong: CN102930042A,2013-02-13.

[8] Dongli, Zhao flourishing, Zhang Xiang, Wang Ru A text tendency analysis method and a commodity comment tendency discriminator [ P ] Shaanxi based on the method: patent application publication No. CN103455562A, 2013-12-18.

[9] Method and apparatus for determining text orientation [ P ]. beijing: patent application publication numbers CN104572616A, 2015-04-29.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a building evaluation method based on semantic analysis of web texts.

The purpose of the invention can be achieved by adopting the following technical scheme:

a building evaluation method based on web text semantic analysis comprises the following steps:

s1, selecting a professional building forum, acquiring the web text by using L ocoy Spider software, and screening and sorting the web text;

s2, performing semantic analysis on the web text through a Chinese word segmentation tool and a Chinese word frequency analysis tool, and performing screening matching and nonparametric inspection on the web text and a word class word frequency table of a modern Chinese language database to establish a network building professional language database;

and S3, analyzing the characteristic words of the construction individual case, comparing the characteristic words of the construction individual case with the network construction professional corpus, and analyzing the attention difference of the network masses and professional construction designers on the construction individual case.

Preferably, in step S1, the selecting a professional building forum, acquiring the web text by using L ocoy spreader software, and performing screening and sorting specifically includes:

s11, selecting a professional building forum with sufficient comment samples as a data source;

s12, editing a newly-built locomotive task by using L ocoy Spider software, analyzing a source code of a webpage structure of a professional building forum, selecting front and rear corresponding fields as identification character strings for capturing required webpage information, wherein main tag information obtained by crawling comprises a professional building forum theme, a comment user name, comment time and comment content;

s13, setting in the rule of the collected content of the locomotive task, and operating the locomotive task to crawl relevant data;

s14, refining and sorting the acquired comment data according to the labels of the professional building forum topics, the comment user names, the comment time and the comment contents, and rejecting the professional building forum bulletins and the advertisement posts.

Preferably, in step S2, the semantic analysis of the web text is performed by the word segmentation tool for the ending and the word frequency analysis tool for the chinese language, and the screening matching and the non-parameter inspection are performed with the word frequency table for the word segmentation class of the modern chinese language database to establish the professional language database for the web architecture, which specifically includes:

s21, converting the screened and sorted professional building forum comment data into a txt text format, and performing word segmentation by using a finish word segmentation tool to form a word list of professional building forum comments;

s22, counting the frequency, the repetition number, the percentage and the de-weight percentage of each vocabulary for the comment data of the professional building forum by utilizing a Chinese word frequency counting tool according to the vocabulary list formed in the step S21;

s23, according to the word frequency table of the modern Chinese language database in the online website of the database, matching and obtaining a certain number of word samples and the word frequency number of the word samples in the building professional building forum and the modern whole Chinese language database;

s24, performing standard normalization processing on the two groups of word frequency data;

s25, importing the data after the standard normalization processing into SPSS software, performing non-parameter detection analysis on two groups of word frequency numbers by using two paired sample non-parameter detection commands, and judging whether the overall distribution of the two paired samples has significant difference;

s26, analyzing the importance of the professional building forum vocabulary based on the TextRank algorithm when the overall distribution of the two paired samples is significantly different;

s27, sequencing the building professional building forum vocabularies from high to low according to the vocabulary importance data formed in the step S26, screening and removing the high-frequency vocabularies of the modern Chinese corpus according to the vocabulary frequency table of the modern Chinese corpus in the corpus online website, and taking the rest vocabularies as the network building professional vocabularies;

s28, classifying and sorting the network building professional vocabularies formed in the step S27 according to building types, building functions, building shapes, traffic layouts, building environments, building colors, building materials and structures, spatial layouts, building results, building components and building roles, and establishing a network building professional corpus.

Preferably, in step S3, the analyzing the difference between the network masses and the professional building designers regarding the attention of the building personal by analyzing the feature vocabulary of the building personal and comparing the feature vocabulary of the building personal with the network building professional corpus specifically includes:

s31, converting the screened and sorted building case comment data into a txt text format, and performing word segmentation by using a Chinese word segmentation tool to form a word list of the building case comment;

s32, counting the frequency, the repetition number, the percentage and the de-weight percentage of each vocabulary for the building individual case comment data by utilizing a Chinese word frequency counting tool according to the vocabulary list formed in the step S31;

s33, according to the word frequency table of the modern Chinese language database in the online website of the database, matching and obtaining a certain number of word samples and the word frequency number of the word samples in the building case comments and the modern whole Chinese language database;

s34, performing standard normalization processing on the two groups of word frequency data;

s35, importing the data after the standardization processing into SPSS software, carrying out nonparametric inspection analysis on two groups of word frequency numbers by using nonparametric inspection commands of two paired samples, and judging whether the overall distribution of the two paired samples has significant difference;

s36, when the overall distribution of the two matched samples is different in significance, analyzing the importance of the construction scheme vocabulary based on the TextRank algorithm;

s37, according to the word importance data formed in the step S36, the building individual case words are sorted from high to low in importance, high-frequency words of a modern Chinese language database appearing in the building individual case words are screened and removed according to a word frequency table of the modern Chinese language database in the online website of the language database, and the rest words are used as building individual case feature words;

and S38, comparing the characteristic words of the building individual case formed in the step S37 with the network building professional corpus, and analyzing the attention difference of the network masses and professional building designers on the building individual case.

Preferably, the performing of the standard normalization processing on the two groups of word frequency data specifically includes:

suppose the ith word count of the jth group vocabulary list is α_ijAfter the standard normalization processing, the standard value theta is obtained_ijThe concrete formula is as follows:

in the formula: 1,2 …, x; j is 1, 2.

Preferably, the non-parameter test analysis of the two groups of word frequency numbers is performed by using the two paired sample non-parameter test commands, and whether the overall distribution of the two paired samples has a significant difference is determined, specifically:

subtracting the observed values β of the first set of samples from the observed values of the second set of samples according to the symbolic test method_ij(ii) a If the difference is a positive number, recording as a positive number; if the difference is negative, marking as a sign; when the difference value is equal to 0, deleting the corresponding building individual case, and correspondingly reducing the number x of the samples;

the difference data is retained and sorted in ascending order according to the absolute value of the difference data to find the corresponding rank value β_iAnd respectively calculating the sign as positive sign rank and W₊Negative rank and W_-And positive average rank U₊Negative average rank U_{_}；

The specific calculation formula is as follows:

U₊＝W₊m or U_-＝W_-/n

Wherein m and n represent the number of positive rank values and negative rank values, respectively;

calculating the test statistic Z value and the accompanying probability value Sig calculated by the SPSS, and comparing the values with a set significance level to judge whether two groups of sample data have significance difference, wherein the significance difference is as follows:

W＝min(W₊，W_{_})

wherein n' is the number of valid samples for which the erasure difference is zero;

if the obtained probability value is less than or equal to the set significance level, the overall distribution from which the two paired samples come is considered to have significance difference; if the resulting probability value is higher than the set significance level, the overall distribution from which the two paired samples come is considered to be not significantly different.

Preferably, the importance of the vocabulary is expressed as follows:

wherein, P (V)_i) Is the medium importance of the word i, d is the damping coefficient, In (V)_i) Is a set of speech segments containing a vocabulary i, Out (V)_j) Is a collection of word segments in the vocabulary j, | Out (V)_j) And | is the number of elements in the set.

Preferably, the method further comprises:

and S4, classifying the overall comment data of the individual building case according to different building schemes, and analyzing the attention elements of the network masses to the different schemes.

Preferably, in step S4, the classifying the overall comment data of the individual architecture case according to different architecture schemes, and analyzing the elements of interest of the network public for different schemes specifically includes:

s41, classifying the comments of the construction cases on the professional construction forum according to different schemes, and respectively converting the comments into txt file formats;

s42, respectively counting the frequency number, the repetition number, the percentage and the de-weight percentage of each vocabulary for the plurality of building scheme comment data formed in the step S41 by utilizing a Chinese word frequency counting tool according to the vocabulary list formed in the step S31;

s43, according to the word frequency data formed in the step S42, the high frequency word data are taken to be subjected to standard normalization processing, and the following formula is shown:

suppose that the ith word frequency in the high-frequency vocabulary data is α_iAfter the standard normalization processing, the standard value theta is obtained_iThe concrete formula is as follows:

wherein i is 1,2 …, x;

s44, judging the characteristic vocabulary of each building scheme, and assuming that the standard value of the ith word frequency number of the jth scheme is P_ijThen the word frequency significance value of the standard value is

The specific calculation formula is as follows:

wherein i is 1,2 …, x; j is 1, 2;

s45, getting

The vocabulary is used as the characteristic vocabulary of the building scheme, namely the concerned elements of the network masses for different schemes are obtained.

Compared with the prior art, the invention has the following beneficial effects:

1. the method utilizes the web text of the comments of the professional building forum in the large public building design, acquires the web text of the professional building forum through L ocoySpider software, performs semantic analysis on the web text through a Chinese word segmentation tool and a Chinese word frequency analysis tool, performs screening matching and nonparametric inspection on the web text and a word frequency table of the segmentation class of a modern Chinese language database, and establishes the professional language database of the network building, which is an effective supplement for the lack of a related language database in the field of the conventional building comments.

2. The method can analyze the attention difference of network masses and professional building designers on the building individual case by analyzing the characteristic vocabulary of the building individual case, is favorable for adapting the building comment language to a new media environment, enables more building comment languages to be reflected in the building design, and promotes the development of building evaluation and building design.

3. The method can classify the overall comment data of the individual building case according to different building schemes, and analyze the comment data to obtain the feature words of each building scheme in the individual building case, so that professional architectural designers can know the concerned elements of network masses for different schemes, and the most appropriate building scheme is determined.

Drawings

Fig. 1 is a flowchart of a building evaluation method according to embodiment 1 of the present invention.

Fig. 2 is a diagram of a relative ratio of high-frequency words in the network architecture professional corpus and words in the modern chinese corpus in embodiment 2 of the present invention.

Fig. 3 is a schematic diagram of the construction competition scheme of the zhangjiakou olympic sports center in embodiment 2 of the present invention.

Fig. 4 is a diagram of a relative ratio of the building plan feature words in embodiment 2 of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Example 1:

as shown in fig. 1, the building evaluation method of this embodiment establishes a building review professional corpus in a network environment based on web texts of large public building design professional building forum reviews, and analyzes the difference in the attention of designers and web masses to building individuals, and includes the following steps:

1) selecting a professional building forum, acquiring the web text by using L ocoy Spider software, and screening and sorting;

1.1) selecting a professional building forum with sufficient comment samples as a data source;

1.2) editing a newly-built locomotive task by using L ocoy Spider software, analyzing a source code of a webpage structure of a professional building forum, selecting front and rear corresponding fields as identification character strings for capturing required webpage information, wherein main tag information obtained by crawling comprises a professional building forum theme, a comment user name, comment time, comment content and the like;

1.3) setting in the rule of the collected content of the locomotive task, and operating the locomotive task to crawl relevant data;

1.4) completing and sorting the obtained comment data according to the tags of the subject of the professional building forum, the comment user name, the comment time and the comment content, and rejecting posts such as bulletins, advertisements and the like of the professional building forum.

2) Semantic analysis of the web text is carried out through a Chinese word frequency analysis tool and a Chinese word segmentation tool, screening matching and nonparametric inspection are carried out on the semantic analysis and the word frequency table of the segmentation class of the modern Chinese language database, and a network building professional language database is established;

2.1) converting the screened and sorted professional building forum comment data into a txt text format, and performing word segmentation by using a finish word segmentation tool to form a word list of professional building forum comments;

2.2) according to the vocabulary list formed in the step 2.1), counting the frequency, the repetition number, the percentage and the de-weight percentage of each vocabulary for the comment data of the professional building forum by using a Chinese word frequency counting tool;

2.3) matching and acquiring a certain number of vocabulary samples and the word frequency number of the vocabulary samples in the building professional building forum and the modern whole Chinese language corpus according to the word frequency table of the modern Chinese language corpus in the corpus online website (www.cncorpus.org);

2.4) carrying out standard normalization processing on the two groups of word frequency data, and assuming that the ith word frequency number of the jth group vocabulary list is α_ijAfter the standard normalization processing, the standard value theta is obtained_ijThe concrete formula is as follows:

in the formula: 1,2 …, x; j is 1, 2;

2.5) importing the data after the standard normalization processing into SPSS software, performing non-parameter detection analysis on two groups of word frequency numbers by using two paired sample non-parameter detection commands, and judging whether the overall distribution of the two paired samples has significant difference, specifically:

the difference data is retained, sorted in ascending order according to the absolute value of the difference data, and the corresponding rank value β is obtained_iAnd respectively calculating the sign as positive sign rank and W₊Negative rank and W_-And positive average rank U₊Negative average rank U_-；

The specific calculation formula is as follows:

U₊＝W₊m or U_-＝W_-/n (3)

calculating a test statistic Z value and an accompanying probability value Sig calculated by the SPSS, and comparing the values with a set significance level to judge whether two groups of sample data have significance difference;

W＝min(W₊，W_{_}) (4)

if the obtained probability value is less than or equal to the set significance level, the overall distribution from which the two paired samples come is considered to have significance difference; if the obtained probability value is higher than the set significance level, the overall distribution from the two paired samples is considered to have no significant difference;

2.6) when the overall distribution of the two matched samples is different significantly, analyzing the importance of the vocabularies of the professional architectural forum based on the TextRank algorithm, wherein the formula is as follows:

wherein, P (V)_i) Is the medium importance (PR value) of the word i, d is the damping coefficient, In (V)_i) Is a set of speech segments containing a vocabulary i, Out (V)_j) Is a collection of word segments in the vocabulary j, | Out (V)_j) I is the number of elements in the set, the ranking is carried out according to the importance of the words from high to low, and the importance of the words which are ranked more front is higher in the comments;

2.7) according to the vocabulary importance data formed in the step 2.6), sequencing the vocabularies of the architecture professional architecture forum from high to low, screening and removing high-frequency vocabularies of the modern Chinese corpus according to a vocabulary frequency table of the modern Chinese corpus in the online website of the corpus, and taking the rest vocabularies as the network architecture professional vocabularies;

2.8) classifying and sorting the network building professional vocabularies formed in the step 2.7) according to building types, building functions, building shapes, traffic layouts, building environments, building colors, building materials and structures, spatial layouts, building results, building components, building roles and the like, and establishing network building professional linguistic data;

3) analyzing the characteristic vocabularies of the construction individual case, comparing the characteristic vocabularies of the construction individual case with a network construction professional corpus, and analyzing the attention difference of network masses and professional construction designers on the construction individual case;

3.1) converting the screened and sorted building case comment data into a txt text format, and performing word segmentation by using a Chinese word segmentation tool to form a vocabulary list of the building case comment;

3.2) counting the frequency, the repetition number, the percentage and the de-weight percentage of each vocabulary for the building individual case comment data by utilizing a Chinese word frequency counting tool according to the vocabulary list formed in the step 3.1);

3.3) matching and obtaining a certain number of vocabulary samples and the word frequency number of the vocabulary samples in the building case comments and the modern whole Chinese language database according to the word frequency table of the modern Chinese language database in the online website of the database;

3.4) carrying out standard normalization processing on the two groups of word frequency data, and realizing by adopting the formula (1);

3.5) importing the data after the standardization treatment into SPSS software, carrying out nonparametric inspection analysis on two groups of word frequency numbers by using nonparametric inspection commands of two paired samples, judging whether the overall distribution of the two paired samples has significant difference or not, and realizing by adopting the above formulas (2) to (5);

3.6) when the overall distribution of the two matched samples is different significantly, analyzing the importance of the construction scheme vocabulary based on the TextRank algorithm, and realizing by adopting the formula (6);

3.7) according to the vocabulary importance data formed in the step 3.6), sequencing the importance of the building individual case vocabularies from high to low, and screening and removing high-frequency vocabularies of the modern Chinese corpus according to a vocabulary frequency table of the modern Chinese corpus in the online website of the corpus, wherein the rest vocabularies are used as building individual case characteristic vocabularies;

3.8) comparing the characteristic words of the individual building case formed in the step 3.7) with a network building professional corpus, and analyzing the attention difference of network masses (common citizens) and professional building designers on the individual building case;

4) classifying the overall comment data of the individual building case according to different building schemes, and analyzing the attention elements of the network masses to the different schemes;

4.1) classifying the comments of the construction cases on the professional construction forum according to different schemes, and respectively converting the comments into txt file formats;

4.2) according to the vocabulary list formed in the step 3.1), utilizing a Chinese word frequency statistical tool to respectively count the frequency number, the repetition number, the percentage and the de-weight percentage of each vocabulary for the plurality of building scheme comment data formed in the step 4.1);

4.3) according to the word frequency data formed in the step S42, taking the high-frequency word data to perform standard normalization processing, as follows:

wherein i is 1,2 …, x;

4.4) judging the characteristic vocabulary of each building scheme, and assuming that the standard value of the ith word frequency of the jth scheme is P_ijThen the word frequency significance value of the standard value is

The specific calculation formula is as follows:

wherein i is 1,2 …, x; j is 1, 2;

4.5) taking

The vocabulary is used as the characteristic vocabulary of the building scheme, namely the concerned elements of the network masses for different schemes are obtained, so that professional building designers can know the concerned elements of the network masses for different schemes, and the most appropriate building scheme is determined.

Example 2:

the embodiment is an application example, selecting comment contents of Abbs building forum and Zhangkou daily newspaper WeChat subscription number-Zhangkou gym design scheme voting platform as research analysis cases, 4401 posts and 32801 building comment contents based on building scheme version and building communication version of Abbs building forum, and 4662 comment contents of Zhangkou daily newspaper WeChat subscription number-Zhangu gym design scheme voting platform, and the specific implementation steps of the whole process include:

1) selecting an ABBS building forum and a Zhangkou daily newspaper WeChat subscription number-Zhangkou gymnasium design scheme voting platform, acquiring the web text by using L ocoy Spider software, and screening and sorting.

1.1) analyzing a source code of an ABBS webpage structure;

1.2) selecting corresponding fields before and after as identification character strings for capturing required webpage information, wherein the main tag information which is captured comprises forum topics, comment user names, comment time, comment contents and the like.

1.3) setting in the rule of the collected content of the locomotive task, and operating the locomotive task to crawl relevant data, wherein due to the existence of a multi-level website structure in the building forum of Abbs, the required comment data can be obtained by establishing a plurality of locomotive tasks;

1.4) analyzing a family daily newspaper WeChat subscription number-a source code of a design scheme voting platform webpage structure of a family gymnasium;

1.5) selecting corresponding fields before and after as identification character strings for capturing required webpage information, wherein the main tag information which is crawled is comment content;

1.6) setting in the rule of the collected content of the locomotive task, and operating the locomotive task to crawl relevant data;

1.7) completing and sorting the obtained comment data according to tags such as forum topics, comment user names, comment time, comment contents and the like, and rejecting posts irrelevant to forum bulletins, advertisements and the like.

2) Semantic analysis of the network text is carried out through a Chinese word frequency analysis tool and a Chinese word segmentation tool, and screening matching and nonparametric inspection are carried out on the semantic analysis and the word frequency table of the segmentation class of the modern Chinese language database, so that the network architecture professional language database is established.

2.1) converting the screened and sorted forum comment data into a txt text format, and performing word segmentation by using a Chinese word segmentation tool of 'ending' to form a word list of ABBS forum comments;

2.2) according to the vocabulary list formed in the step 2.1), counting the frequency, the repetition number, the percentage and the de-weight percentage of each vocabulary of the forum comment data formed in the step 2.1 by using a Chinese word frequency counting tool;

2.3) matching and obtaining the vocabulary samples with the word frequencies ranked 50 above and the word frequencies of the building forums and the modern whole Chinese corpus formed in the step 2.2) according to the word frequency table of the modern Chinese corpus in the corpus online website (www.cncorpus.org); a portion of a modern chinese corpus is shown in table 1 below.

TABLE 1 modern Chinese language material base

2.4) carrying out standard normalization processing on the two groups of word frequency data, wherein the formula refers to the formula (1) in the embodiment 1;

2.5) importing the standardized data into SPSS software, and performing nonparametric inspection analysis on two groups of word frequency numbers by using two paired sample nonparametric inspection commands;

2.5.1) subtracting the observed values β of the first group of samples from the observed values of the second group of samples according to the method of sign checking_ij. If the difference is a positive number, recording as a positive number; if the difference is negative, marking as a sign; if the difference is equal to 0, the case is deleted and the number x of samples is reduced accordingly.

2.5.2) retaining the difference data, sorting in ascending order according to the absolute value of the difference data, finding the corresponding rank value β_iAnd respectively calculating the sign as positive sign rank and W₊Negative rank and W_-And positive average rank U₊Negative average rank U_{_}See formulas (2) and (3) of example 1;

2.5.3) calculating the test statistic Z value and the accompanying probability value Sig calculated by the SPSS, and comparing the values with the set significance level to judge whether the two groups of sample data have significance difference. See formulas (4) and (5) of example 1;

the absolute value of the Z value of the test statistic obtained through calculation is 114.477, and the accompanying probability value Sig is 0.000, which shows that the building professional corpus is significantly different from the modern Chinese integral corpus, and the building network professional forum has characteristic words to be further analyzed.

2.6) analyzing the importance of the words of the architectural forum based on a TextRank algorithm, wherein the formula (6) in the embodiment 1 is shown, the words are ranked from high to low according to the importance of the words, and the more top words have higher importance in comments;

2.7) according to the vocabulary importance data formed in the step 2.6), sequencing the vocabulary of the building forum from high to low, and according to a word frequency table of a modern Chinese language database in an online website of the language database, screening and removing high-frequency vocabularies of the modern Chinese language database, wherein the rest vocabularies are used as network building professional vocabularies;

2.8) classifying and sorting the network building professional vocabularies formed in the step 2.7) according to the types of building types, building functions, building shapes, traffic layouts, building environments, building colors, building materials and structures, spatial layouts, building results, building components, building roles and the like, and establishing a network building professional corpus as shown in the following table 2; the high frequency words of the web architecture professional corpus are compared with the vocabulary of the modern chinese corpus as shown in fig. 2.

TABLE 2 network architecture professional corpus

3) By analyzing the characteristic vocabulary of the construction individual case, the attention points of the network masses to the construction individual case and the attention difference with a designer are analyzed.

3.1) converting the screened and sorted building individual case comment data into a txt text format, and performing word segmentation by using a result word segmentation tool to form a vocabulary list of the building individual case comment;

3.3) matching and obtaining vocabulary samples with the word frequencies of 50 th before the word frequency ranking formed in the step 3.2) and word frequency numbers of the building case comments and the modern integral Chinese corpus according to a word frequency table of the modern Chinese corpus of the online website of the corpus;

3.4) carrying out standard normalization processing on the two groups of word frequency data, wherein the formula refers to the formula (1) in the embodiment 1;

3.5) importing the standardized data into SPSS software, and performing non-parameter test analysis on two groups of word frequencies by using two paired sample non-parameter test commands, wherein the formulas are shown in formulas (2) to (5) in embodiment 1, the absolute value of the test statistic Z value is 7.513 through calculation, and the accompanying probability value Sig is 0.000, so that the obvious difference exists between the building individual case comment vocabulary of Zhangjiakou gym and the modern Chinese language corpus;

3.6) analyzing the importance of the vocabulary of the construction case based on a TextRank algorithm, wherein the formula is shown in formula (6) of example 1;

3.7) according to the vocabulary importance data formed in the step 3.7), sequencing the importance of the building individual case vocabularies from high to low, and screening and removing high-frequency vocabularies of the modern Chinese corpus according to a vocabulary frequency table of the modern Chinese corpus in the online website of the corpus, wherein the rest vocabularies are used as building individual case characteristic vocabularies;

3.8) comparing the building individual case characteristic vocabulary formed in the step 3.7) with a building professional corpus, and analyzing the attention difference between the network masses and professional building designers;

4) classifying the overall comment data of the individual architecture case according to different architecture schemes, and analyzing the attention elements of the network masses for the different schemes, wherein each architecture scheme of the embodiment is shown in fig. 3;

4.3) forming word frequency data according to the step 4.2), and taking the high frequency word data to perform standard normalization processing, wherein the formula refers to the formula (7) of the embodiment 1;

4.4) judging the characteristic vocabulary of each building scheme, and assuming that the standard value of the ith word frequency of the jth scheme is P_ijThen its word frequency significance value is

See formula (8) of example 1;

4.5) taking

The vocabulary of (1) is used as the characteristic vocabulary of the building plan, as shown in the following table 3;

construction scheme	Number of comments	Characteristic vocabulary of building scheme
			Scheme two	965	Atmosphere, building, characteristic, space, function and shape
Scheme three	132	Comprehensive, idea, simple and cost
			Scheme five	3222	Building, comprehensive, practical, beautiful and elegant

TABLE 3 characteristic vocabulary of each building plan of family

Comparing the feature vocabulary of the construction plans (plans two, three and five) as shown in fig. 4, for example, the elements of the network public concerned about each plan can be seen from table 3 and fig. 4, and the professional architectural designer can determine the most suitable construction plan according to the elements.

In conclusion, the method utilizes the web text of the comments of the professional building forum in the large public building design, acquires the web text of the professional building forum through L ocoy Spider software, performs semantic analysis on the web text through a Chinese word segmentation tool and a Chinese word frequency analysis tool, performs screening matching and nonparametric inspection on the web text and a word frequency table of the segmentation class of the modern Chinese language database, and establishes the professional web building corpus, which is an effective supplement for the lack of related language databases in the field of the conventional building comments.

The above description is only for the preferred embodiments of the present invention, but the protection scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution of the present invention and the inventive concept thereof within the scope of the present invention.

Claims

1. A building evaluation method based on web text semantic analysis is characterized by comprising the following steps: the method comprises the following steps:

s3, analyzing the characteristic words of the individual building case, comparing the characteristic words of the individual building case with a network building professional corpus, and analyzing the attention difference between the network masses and professional building designers on the individual building case;

s4, classifying the overall comment data of the individual building case according to different building schemes, and analyzing the attention elements of the network masses to the different schemes;

in step S3, the analyzing the feature vocabulary of the construction personal case, comparing the feature vocabulary of the construction personal case with the network construction professional corpus, and analyzing the difference between the network public and the professional construction designer regarding the construction personal case specifically includes:

2. The building evaluation method based on semantic analysis of web texts as claimed in claim 1, wherein in step S1, said selecting a professional building forum, obtaining web texts by using L ocoy Spider software, and performing screening and sorting specifically comprises:

3. The building evaluation method based on web text semantic analysis according to claim 1, characterized in that: in step S2, the semantic analysis of the web text is performed by the results segmentation tool and the chinese word frequency analysis tool, and the web text is subjected to screening matching and non-parameter inspection with the word frequency table of the segmentation class of the modern chinese corpus to establish a web architecture professional corpus, which specifically includes:

4. The building evaluation method based on web text semantic analysis according to claim 1 or 3, characterized in that: the standard normalization processing is performed on the two groups of word frequency data, and specifically comprises the following steps:

in the formula: 1,2 …, n; j is 1, 2.

5. The building evaluation method based on web text semantic analysis according to claim 1 or 3, characterized in that: the non-parameter detection analysis of two groups of word frequency numbers is carried out by utilizing the non-parameter detection commands of the two paired samples, and whether the overall distribution of the two paired samples has significant difference is judged, specifically:

the difference data is retained and sorted in ascending order according to the absolute value of the difference data to find the corresponding rank value β_iAnd respectively calculating the sign as positive sign rank and W₊Negative rank and W_-And positive average rank U₊Negative average rank U_-；

The specific calculation formula is as follows:

or

U₊＝W₊M or U_-＝W_-/n

W＝min(W₊，W_-)

6. The building evaluation method based on web text semantic analysis according to claim 1 or 3, characterized in that: the importance of the vocabulary, the formula is as follows:

7. The building evaluation method based on web text semantic analysis according to claim 1, characterized in that: in step S4, the classifying the overall comment data of the individual building case according to different building schemes, and analyzing the attention elements of the network public for different schemes specifically includes:

s43, according to the word frequency data formed in the step S42, the high frequency word data are taken to be subjected to standard normalization processing, and the standard normalization processing comprises the following steps:

assuming a high frequencyThe ith word frequency in the vocabulary data is α_iAfter the standard normalization processing, the standard value theta is obtained_iThe concrete formula is as follows:

wherein i is 1,2 …, n;

The specific calculation formula is as follows:

wherein i is 1,2 …, n; j is 1, 2;

s45, getting