CN111178051A

CN111178051A - Building information model self-adaptive Chinese word segmentation method and device

Info

Publication number: CN111178051A
Application number: CN201911404637.6A
Authority: CN
Inventors: 周小平; 张鑫; 王佳
Original assignee: Bim Winner Shanghai Technology Co ltd; Jiaxing Wuzhen Yingjia Qianzhen Technology Co ltd; Shenzhen Bim Winner Technology Co ltd; Shenzhen Qianhai Yingjia Data Service Co ltd; Bim Winner Beijing Technology Co ltd
Current assignee: Bim Winner Beijing Technology Co ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-05-19
Anticipated expiration: 2039-12-30
Also published as: CN111178051B

Abstract

The embodiment of the invention provides a self-adaptive Chinese word segmentation method and a device for a building information model, wherein the method comprises the following steps: building a BIM model dictionary according to the target BIM model; embedding a pre-constructed building field term dictionary and the BIM model dictionary into a pre-trained Chinese word segmentation model; and segmenting words of the sentence to be segmented in the building field based on the Chinese word segmentation model embedded into the BIM model dictionary and the building field term dictionary. When the BIM model needs to be subjected to data mining, the embodiment of the invention specifically uses the characteristic information actually used in the BIM model to optimize the Chinese word segmentation model, so that the word segmentation result is more suitable for the data mining of the BIM model, the self-adaptive Chinese word segmentation of the BIM model is realized, the efficiency and the accuracy of the Chinese word segmentation in the building information field are improved, the universality of the conventional methods such as BIM model retrieval and entity matching is improved, and the application range of the method is expanded.

Description

Building information model self-adaptive Chinese word segmentation method and device

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a Chinese word segmentation method and device for building information model self-adaption.

Background

A Building Information Model (BIM) is a digital Information model that records physical and functional characteristics of a Building facility. The BIM contains detailed information of each stage in the whole life cycle of the building, so that the interoperability of the life cycle data of the building is realized, and the effective cooperation of each participant of the building engineering is promoted. Currently, BIM has become an effective solution and an important trend for Engineering informatization of the Construction industry (AEC), and has been widely researched and applied in Construction enterprises.

Generally, a BIM model records complete data information of an engineering project. Most current studies on BIM are developed for one or more BIM models. In order to improve the application efficiency of the BIM model, partial scholars develop researches such as information retrieval and entity matching facing the BIM model. The performance of these studies was directly influenced by the word segmentation effect. Word segmentation refers to the segmentation of a text sequence into multiple words, which is the fundamental and first step of many BIM data mining processes. Good word segmentation can improve the efficiency of methods such as model-level information retrieval and entity matching, increase the universality of the methods and expand the application range of the methods.

For example, when searching for "double-click double-control switch for five-layer northeast strong electric rooms", if the search content can be correctly segmented into "five layers", "northeast strong electric rooms", "of" and "double-click double-control switch", it is helpful for the information search system to quickly and accurately locate the corresponding component in the BIM model. However, the word segmentation accuracy of the current mainstream word segmentation method is influenced by the training corpus and the application field, and is difficult to be directly applied to project-level data mining research based on BIM, such as information retrieval and entity matching. For example, the word segmentation results of the current mainstream word segmentation methods are "five layers", "northeast", "strong", "inter-electric", "double click", "double control", and "switch", or "five layers", "northeast", "strong electric", "double click", and "double control switch". Although some scholars develop word segmentation method research oriented to the building field, the adaptability of the word segmentation method research in a specific BIM model is poor.

Disclosure of Invention

In order to overcome the problems that the word segmentation result of the existing Chinese word segmentation method cannot be directly applied to data mining of the BIM model and has poor adaptability or at least partially solves the problems, the embodiment of the invention provides a Chinese word segmentation method and device for building information model self-adaption.

According to a first aspect of the embodiments of the present invention, there is provided a building information model adaptive chinese word segmentation method, including:

building a BIM model dictionary according to the target BIM model;

embedding a pre-constructed building field term dictionary and the BIM model dictionary into a pre-trained Chinese word segmentation model;

and segmenting words of the sentence to be segmented in the building field based on the Chinese word segmentation model embedded into the BIM model dictionary and the building field term dictionary.

Specifically, the step of constructing the BIM model dictionary according to the target BIM model includes:

screening out attributes containing characteristic information of the building project from all IFC objects of the target BIM model;

constructing a project characteristic information model according to the attributes; wherein, the attribute in the project characteristic information model is different in the characteristic information of the building project contained in different IFC objects;

and constructing the BIM model dictionary according to the project characteristic information model.

Specifically, the step of embedding the pre-constructed building domain term dictionary and the BIM model dictionary into the pre-trained Chinese segmentation model comprises the following steps:

and embedding a pre-constructed building field term dictionary and the BIM model dictionary into a pre-trained Chinese word segmentation model based on a dictionary embedding algorithm of the double-array Trie tree.

Specifically, the step of embedding a pre-constructed building field term dictionary and the BIM model dictionary into a pre-trained Chinese word segmentation model based on a dictionary embedding algorithm of a double-array Trie tree comprises the following steps:

building a wire tree of the building domain term dictionary and a wire tree of the BIM model dictionary;

constructing a double array of the building field term dictionary according to the wire tree of the building field term dictionary, and constructing a double array of the BIM model dictionary according to the wire tree of the BIM model dictionary;

adding the double array of the building domain term dictionary and the double array of the BIM model dictionary to the trained Chinese segmentation model.

Specifically, the step of segmenting words of the sentence to be segmented in the building field based on the Chinese word segmentation model embedded in the BIM model dictionary and the building field term dictionary comprises the following steps:

segmenting words of the sentence to be segmented in the building field based on the Chinese word segmentation model embedded in the BIM model dictionary and the building field term dictionary to obtain an initial word segmentation result;

optimizing the word segmentation result based on the building domain term dictionary in the Chinese word segmentation model embedded into the BIM model dictionary and the building domain term dictionary to obtain an optimized initial word segmentation result;

and optimizing the optimized initial word segmentation result again based on the BIM model dictionary in the Chinese word segmentation model embedded into the BIM model dictionary and the building field term dictionary to obtain the optimized initial word segmentation result again.

Specifically, before the step of embedding the pre-constructed building domain term dictionary and the BIM model dictionary into the pre-trained chinese segmentation model, the method further comprises:

and training the Chinese word segmentation model based on an HMM method or a CRF method.

screening out term phrases in the building field from a professional knowledge base in the building field;

and constructing the construction field term dictionary according to the term phrases.

According to a second aspect of the embodiments of the present invention, there is provided a building information model adaptive chinese word segmentation apparatus, including:

the building module is used for building a BIM model dictionary according to the target BIM model;

the embedding module is used for embedding a pre-constructed building field term dictionary and the BIM model dictionary into a pre-trained Chinese word segmentation model;

and the word segmentation module is used for segmenting words of the sentence to be segmented in the building field based on the Chinese word segmentation model embedded into the BIM model dictionary and the building field term dictionary.

According to a third aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor calls the program instructions to perform the architecture information model adaptive chinese word segmentation method provided in any one of the various possible implementations of the first aspect.

According to a fourth aspect of embodiments of the present invention, there is also provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the building information model adaptive chinese word segmentation method provided in any one of the various possible implementations of the first aspect.

The embodiment of the invention provides a building information model self-adaptive Chinese word segmentation method and a device, the method constructs a self-adaptive BIM model dictionary by facing a specific BIM model, provides model-level linguistic data for Chinese word segmentation, then embeds the BIM model dictionary and a building field term dictionary into a Chinese word segmentation model, and specifically uses characteristic information actually used in the BIM model to optimize the Chinese word segmentation model when the BIM model needs to be subjected to data mining, so that a word segmentation result is more suitable for the data mining of the BIM model, the self-adaptive Chinese word segmentation of the BIM model is realized, the efficiency and the accuracy of the Chinese word segmentation in the building information field are improved, the universality of the existing methods such as BIM model retrieval and entity matching is improved, and the application range of the method is expanded.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic overall flow chart of a Chinese word segmentation method for building information model adaptation according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an overall architecture of a Chinese word segmentation method for building information model adaptation according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a double-array-based Trie tree in the building information model adaptive chinese word segmentation method according to the embodiment of the present invention;

fig. 4 is a schematic diagram of a Trie tree structure of a building domain term dictionary in the building information model adaptive chinese word segmentation method according to the embodiment of the present invention;

fig. 5 is an initial segmentation result before embedding a BIM model dictionary and a building domain term dictionary in the building information model adaptive chinese segmentation method according to the embodiment of the present invention;

fig. 6 is a word segmentation result after a BIM model dictionary and a building domain term dictionary are embedded in the building information model adaptive chinese word segmentation method according to the embodiment of the present invention;

FIG. 7 is a schematic diagram of an overall structure of a Chinese word segmentation apparatus with a self-adaptive building information model according to an embodiment of the present invention;

fig. 8 is a schematic view of an overall structure of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The Building Information Modeling (BIM) is a complete digital expression of engineering facility entities and characteristics thereof, and aims to realize Information integration and sharing of the whole life cycle of a Building. As an information interaction mode throughout the life cycle of a building, BIM is regarded as an effective means for solving the problems of information isolated island, information loss and the like of the building industry.

The Industry Foundation Class (IFC) is an open and standardized database designed to achieve interoperability between building information modeling software applications in the AEC and FM (Facility Management) industries, thereby enabling efficient information flow throughout the life cycle of a building. Among the various building data model interchange formats, the IFC standard is the only public, non-proprietary data model employed in the world today. The IFC provides a feasible extension mechanism and a clear semantic information structure, and lays a solid foundation for acquiring information in the BIM. Since the industry base class is the current international standard for BIM, without loss of generality, the embodiments of the present application consider that BIM data is organized in IFC format.

Currently, with the continuous development of the BIM technology and the natural language processing technology, researchers have proposed many application schemes combining the BIM technology and the natural language processing technology, but the application of the BIM in the chinese information processing scene is premised on the assumption that words can be correctly segmented. Therefore, the model adaptivity of Chinese word segmentation in the field of building information is improved, a feasible word segmentation basis is provided for the application schemes, and the usability of the application schemes is improved.

However, at present, the word segmentation link in the Chinese scene is not deeply explored and researched. In the field of building information, a large number of professional corpora are dispersed in unstructured drawings and text data, and a large-scale professional corpus is difficult to establish. Meanwhile, new spatial nouns and component attributes often appear in engineering projects, and there is no systematic solution for how to integrate the characteristics in the projects into a Chinese word segmentation model at present.

The Chinese word segmentation model is a model for segmenting Chinese text, namely a Chinese character sequence into words, and is the fundamental work of Chinese information processing in the field of natural language processing. Machine translation, speech synthesis, search engines, automatic summarization, etc. all require a Chinese word segmentation model as a basis. The statistical machine learning model is superior to the traditional rule method in Chinese word segmentation task. Conventional statistical machine learning models for word segmentation can be broadly classified into two broad categories, namely word labeling and word learning based, according to the processed structural decomposition unit. Models commonly used for training chinese participle models include Hidden Markov Models (HMMs), Maximum Entropy Models (MEs), and Conditional Random Fields (CRFs), which all perform well in training chinese participle models.

The Chinese word segmentation can be regarded as the Chinese character sequence marking problem, and the 4-Tag marking method which is widely applied is shown in the table 1.

Table 14-Tag word notation examples

Wherein, B represents the beginning of the word segmentation, E represents the end of the word segmentation, M represents the middle part, and S represents a single word.

To facilitate understanding of embodiments of the present invention, the following definitions are given:

definition 1 building information model (BIM, b) is a complete digital representation of geometric and semantic information of a building, including attribute information of members, spaces, etc. of a building item, denoted by b in the present embodiment, and b ═ f₁,f₂,…,f_nDenotes an IFC object.

Define 2 Dictionary (Dictionary, D): in chinese segmentation, a word is a set of word sequences with fixed characteristics, and a dictionary is a set of words, denoted by D. For example, "northeast strong electric room" is a word, and { "northeast strong electric room", "double-click double-control switch", … } is a dictionary.

Definition 3 Building Domain term Dictionary (D, Dictionary of Building Domain Terms)_d): the construction field term dictionary comprises professional terms of related professions of the construction field, such as structural, electrical and water supply and drainage professions and engineering projects, and is used for D_dAnd (4) showing.

Defining a 4BIM model Dictionary (Dictionary of BIM, D)_b) The BIM model dictionary is a dictionary including item model feature information indicating a word indicating a specific component or a spatial region in the BIM model, and D is used for the BIM model dictionary_bAnd (4) showing.

Defining 5 Project Feature Information Model (M), wherein the Project Feature Information Model is a set of attribute Information extracted based on IFC standard, is a subset of IFC object attributes, and provides basis for extracting BIM Model dictionary. The item feature information model is denoted by M, and M ═ k₁,k₂,…,k_nAnd k represents an attribute in the IFC standard.

In an embodiment of the present invention, a building information model adaptive chinese word segmentation method is provided, and fig. 1 is a schematic overall flow chart of the building information model adaptive chinese word segmentation method provided in the embodiment of the present invention, where the method includes: s101, constructing a BIM model dictionary according to a target BIM model;

the target BIM model is a BIM model needing data mining. And constructing a BIM model dictionary according to the feature information of the building project which is actually used and extracted from the target BIM model. The BIM model dictionary is a dictionary for a specific BIM model, and includes newly-appearing feature information in a building project in addition to related data of a building field.

S102, embedding a pre-constructed building field term dictionary and the BIM model dictionary into a pre-trained Chinese word segmentation model;

in the aspect of field adaptivity, when the application field of the Chinese word segmentation model is inconsistent with the field of training the corpus of the Chinese word segmentation model, the word segmentation accuracy of the Chinese word segmentation model based on statistical machine learning is greatly reduced. The markup language material refers to a large training data set which is audited and issued by an authority, and in the field of building information, because data related to the language material in the field of buildings are dispersed in each practical link and exist in a large number of unstructured data forms, the authoritative markup language material is difficult to be established comprehensively.

Therefore, the performance of the Chinese word segmentation model is optimized by collecting the terms and concepts of the building field and then establishing a dictionary of the terms of the building field in the embodiment. However, the building information field has not only a large number of terms and concepts common to the building field, but also very individual spatial relationships and building attribute information in the project model. Therefore, the embodiment not only improves the field adaptability of the Chinese word segmentation model in the building information field by establishing the building field term dictionary, but also improves the model adaptability of the Chinese word segmentation model in the building information field by establishing the BIM model dictionary. The Chinese word segmentation model does not need to rely on the establishment of a huge professional training corpus in the cross-field aspect, and the application range of the Chinese word segmentation model in other professional fields can be greatly expanded.

S103, segmenting words of the sentence to be segmented in the building field based on the Chinese word segmentation model embedded into the BIM model dictionary and the building field term dictionary.

In order to improve the performance of the Chinese word segmentation model in the field adaptivity and model adaptivity of the building information field, the BIM model adaptive Chinese word segmentation method provided by the embodiment is mainly divided into three parts, namely training of the Chinese word segmentation model, construction and introduction of a building field term dictionary, and construction and introduction of a BIM model dictionary. The general architecture is shown in fig. 2, and includes:

firstly, training a Chinese word segmentation model, and training the Chinese word segmentation model by using a statistical machine learning method by using public labeled corpora, wherein in the experimental tool adopted by the embodiment, two statistical machine learning methods, namely HMM and CRF, are mainly adopted;

secondly, adding a building domain term dictionary to the Chinese word segmentation model, constructing a domain term dictionary, and embedding the domain term dictionary into the Chinese word segmentation model to improve the domain adaptability of the Chinese word segmentation model in the building information domain, wherein the building domain term dictionary mainly comes from a professional domain knowledge base and a building specification in the embodiment;

and thirdly, adding a BIM model dictionary into the Chinese word segmentation model, establishing a project characteristic information model from the IFC standard, extracting the BIM model dictionary from the BIM according to the project characteristic information model, embedding the BIM model dictionary into the Chinese word segmentation model, and improving the model adaptability of the Chinese word segmentation model in the field of building information.

The embodiment constructs the self-adaptive BIM model dictionary by facing to the specific BIM model, provides model-level linguistic data for Chinese word segmentation, and then embeds the BIM model dictionary and the building field term dictionary into the Chinese word segmentation model.

On the basis of the foregoing embodiment, the step of constructing the BIM model dictionary according to the target BIM model in this embodiment includes: screening out attributes containing building project characteristic information from all IFC objects of the target BIM model; constructing a project characteristic information model according to the attributes; wherein, the attribute in the project characteristic information model is different in the characteristic information of the building project contained in different IFC objects; and constructing the BIM model dictionary according to the project characteristic information model.

Specifically, according to the IFC standard, the attributes containing the building project characteristic information are screened from each object of the IFC, so that a project characteristic information model is built, and preparation is made for extracting a BIM model dictionary in the next step. The IFC standard has evolved over decades, ending with IFC version 4.0, with over 653 entities and 300 supplemental data types and an extensible set of attributes in the IFC. In order to effectively construct the BIM model dictionary, an attribute set including item characteristic information needs to be screened out in the IFC standard.

In the IFC standard, ifcProduct is an abstract representation of any object related to geometric or spatial context, ifcProperty is an abstract abstraction of all types of properties that can be associated with IFC objects through a property set mechanism, ifcpject establishes context for information to be exchanged and shared, and can represent a building project, ifcprelationship is an abstract abstraction of all objectification relationships in IFC, and ifcpitiallelement is an abstraction of all spatial elements that can be used to define a spatial structure or spatial region.

Attributes in the item characteristic information model represent different item characteristic information in the IFC object. For example, the Name attribute of the ifcProduct object represents the component Name; name and NominalValue represent the implied attribute Name and attribute information in ifcPropety; the ifcProject object also has a LongName attribute representing the name of the project and a Phase attribute representing the link where the object is located in the construction life cycle, and the project characteristic information corresponding to the attributes is required by the Chinese word segmentation model.

Meanwhile, the same attribute represents different item feature information in different IFC objects. For example, Name represents a component Name in the ifcProduct class, a project Name in the ifcProdject represents a link between objects in the ifcRelationship. LongName represents the title of the item in ifcpject and the full title of the space in ifcsemilement.

Currently, the IFC file format is becoming a broad form of BIM data storage, which is viewed as a collection of IFC instances, i.e., b ═ f₁,f₂,…,f_n}. The IFC example contains the project characteristic information required by the project characteristic information model to construct the BIM dictionary. And traversing IFC examples from the IFC files, and extracting project characteristic information from the IFC examples to construct a BIM model dictionary. BIM model dictionary D to be constructed_bAnd a construction domain term dictionary D_dThe method is embedded into a Chinese word segmentation model, so that the model adaptability and the field adaptability of the Chinese word segmentation in the field of building information are effectively improved.

On the basis of the above embodiment, the step of embedding the pre-constructed building domain term dictionary and the BIM model dictionary in the pre-trained chinese segmentation model in this embodiment includes: and embedding a pre-constructed building field term dictionary and the BIM model dictionary into a pre-trained Chinese word segmentation model based on a dictionary embedding algorithm of the double-array Trie tree.

Specifically, thanks to the building field professional knowledge base, the embodiment only needs to screen out the term phrases related to the building field from the building field professional knowledge base to form the building field term dictionary D_d. In order to improve the word segmentation performance of the statistical chinese word segmentation model in the professional field, the embodiment adopts a dictionary embedding algorithm based on DAT (Double Array Trie).

The DAT-based dictionary embedding algorithm implements a Trie tree with two arrays base and check. The input character is defined as c, and the state s is shifted to the state t. As shown in FIG. 3, states s and t correspond to array indices. When the input is c, the state s transitions to the state t, t being base [ s ] + c. The check [ t ] is s in the check array parallel to the base array, i.e. the check array records from which state the t state is converted.

On the basis of the above embodiment, in this embodiment, the step of embedding the pre-constructed building domain term dictionary and the BIM model dictionary in the pre-trained chinese segmentation model based on the dictionary embedding algorithm of the double-array Trie tree includes: building a wire tree of the building domain term dictionary and a wire tree of the BIM model dictionary; constructing a double array of the building field term dictionary according to the wire tree of the building field term dictionary, and constructing a double array of the BIM model dictionary according to the wire tree of the BIM model dictionary; adding the double array of the building domain term dictionary and the double array of the BIM model dictionary to the trained Chinese segmentation model.

Specifically, the dictionary example of the Tire tree structure is shown in fig. 4, where a box represents a word node and a circle represents the end of a word, i.e., a termination node. In fig. 4, terms that can be recognized include "double-click double-control", "double-click double-control switch", "strong electric room", "hyperbolic dome", and "northeast strong electric room". Taking the search of the northeast strong electric room as an example, firstly searching the east character, then searching the northeast character on the subtree of the east character, and meanwhile, the northeast character is not in the termination state, which means that the northeast character is a part of a term vocabulary, and searching in sequence until the termination state, the northeast strong electric room matching is completed, and the dictionary matching is finished.

After constructing the Tire tree of the building domain term dictionary and the BIM model dictionary, the nodes in the Trie tree are coded according to the sequence of hierarchical traversal, namely the root node is 0, the east is 1, the strength is 2, the double is 3, the north is 4, the electricity is 5, the strike is 6, the bend is 7, the space is 8, the arch is 9, the control is 10, the top is 11, the on is 12 and the off is 13. And assigning the base array and the check array according to the assignment. Assuming that the initial values of the arrays are all 0, taking a "strong electric field" structure as an example, the "strong" sequence code is 2, the corresponding index is 2, and the "electric" word sequence code is 5, then the "strong electric" needs to satisfy that base [ a +5] ═ check [ a +5] ═ 0. If a is a positive integer, that is, if a is 1, the index value corresponding to "strong current" is 6, and check [6] is 2. When base [2] +5 ═ 6 is satisfied, base [2] = 1 can be determined, and in turn, the subscript of "strong electricity" can be determined to be 10, check [10] = 6, and the status code of "middle" is 8, that is, base [6] ═ 2 can be determined. If a word is in the end state at the current word and the corresponding subscript is i, then base [ i ] ═ base [ i ], and if base [ i ] ═ 0, then base [ i ] ═ i. The constructed double array based on fig. 4 is shown in table 2.

TABLE 2 double-array example table

For example, for the spatial term "strong electric", the look-up procedure based on table 2 is as follows:

step 1, firstly, according to the sequence code 2 of "strong", finding the subscript 2 of the state "strong".

And 2, finding a subscript base [2] +5 ═ 6 of "strong electricity" according to the sequence code 5 of the next word "electricity" and the last word "strong", finding a subscript base [2] +5 ═ 2 of "strong electricity", and simultaneously, indicating that the "strong electricity" is part of a certain word according to the condition that the base [6] + 2 ═ 2>0, and starting with the "strong" to continue searching.

Step 3, then, similarly, a state "strong electric room" is found, and its subscript is base [6] +8 ═ 10, at this time, base [10] <0, and the search is finished.

On the basis of the above embodiment, in this embodiment, the step of segmenting words in the sentence to be segmented in the building field based on the chinese segmentation model embedded in the BIM model dictionary and the building field term dictionary includes: segmenting words of the sentence to be segmented in the building field based on the Chinese word segmentation model embedded in the BIM model dictionary and the building field term dictionary to obtain an initial word segmentation result; optimizing the word segmentation result based on the building domain term dictionary in the Chinese word segmentation model embedded into the BIM model dictionary and the building domain term dictionary to obtain an optimized initial word segmentation result; and optimizing the optimized initial word segmentation result again based on the BIM model dictionary in the Chinese word segmentation model embedded into the BIM model dictionary and the building field term dictionary to obtain the optimized initial word segmentation result again.

Specifically, the initial segmentation result before embedding the BIM model dictionary and the building domain term dictionary is shown in fig. 5, and the case where BB or the like is not likely to exist is removed in fig. 5. Before adding the BIM model dictionary and the construction field term dictionary, the spatial term "northeast strong electricity room" has a plurality of word segmentation possibilities, and may be labeled as "northeast", "strong electricity room", or "northeast", "strong" and "electricity room". The term "double-click, double-toggle switch" in the field of construction may be labeled as "double-click", "double-toggle", "switch" or "double-click", "double-toggle switch". The segmentation result after embedding the BIM model dictionary and the construction domain term dictionary is shown in FIG. 6. The segmentation of the model term 'northeast strong electricity room' and the building field term 'double-click double-control switch' is correct, the redundancy condition of the word segmentation sequence is greatly reduced, and the word segmentation accuracy is effectively improved.

In the embodiment, the BIM model dictionary and the building field term dictionary are embedded into the Chinese word segmentation model, so that the defects of the Chinese word segmentation model in the cross-field aspect are overcome, and the adaptability of the Chinese word segmentation in the professional field is improved.

On the basis of the above embodiment, in this embodiment, before the step of embedding the pre-constructed building domain term dictionary and the BIM model dictionary into the pre-trained chinese segmentation model, the method further includes: and training the Chinese word segmentation model based on an HMM method or a CRF method.

Specifically, in the statistical training method of the Chinese word segmentation model, two methods, namely an HMM (hidden Markov model) and a CRF (learning random Access memory) are widely applied. The HMM is a joint probability distribution statistical model based on a Markov process, and the CRF is a conditional probability distribution statistical model based on a Markov random field. Compared with the HMM and the CRF, the related model takes the change information between the data content and the data label into consideration during modeling, and achieves better results in many natural language processing tasks. This embodiment uses CRF as an example to describe a chinese word segmentation training method.

CRF learns from the markup corpus X ═ X (X)₁,X₂,…,X_n) To word sequence state Y ═ (Y)₁,Y₂,…,Y_n) The probability function mapping relationship. In the experimental tool of this embodiment, a linear chain random field model is used, and the parameterized form thereof is shown in formula (1).

Wherein X is the value of the markup corpus X, Y represents the value of the word state Y, t_kAnd s_lIs a characteristic function, λ_kAnd mu_lIs the corresponding weight. Z (x) is a normalization or normalization factor, and is the sum of all possible state sequences, as shown in equation (2).

For simplicity, the transition and status features and their weights are first represented by a uniform symbol, with K₁A transfer characteristic, K₂Characteristic of individual states, K ═ K₁+K₂It is recorded as

The transition and state features are then summed at each location i and recorded as

By w_kRepresenting a feature f_kThe weight of (y, x), i.e.

That is, the conditional random field can be simplified into formula (3) and formula (4), which are expressed as

After a statistical model of P (y | x) is established by using CRF, the Chinese word segmentation task is used for obtaining y^*Satisfies P (y | x) max, Z (x) is independent of y, so y^*Is represented by formula (5).

By using algorithms such as Viterbi, the optimal word segmentation result y can be obtained^*。

The present example was experimentally verified as follows.

(1) Evaluation index

The test criteria of the word segmentation performance are mainly accuracy, recall and F value, which are respectively represented by P, R, F1. The accuracy rate represents the accuracy degree of word segmentation of the word segmentation model, the recall rate is also called recall rate, the rate of the correct words segmented by the word segmentation model in the correct result is represented, and the F value comprehensively reflects the overall index of the word segmentation model. The calculation method is as follows.

(2) Experimental Environment

The hardware environment is CPU Intel Core i72.8GHz and the memory 16G. The operating system used is macos 10.14.6. All experiments in this example were performed in python language.

(3) Data test set

In the research of Chinese word segmentation in the field of building information, a corpus is not publicly evaluated. Therefore, the embodiment crawls 1300 articles on the management network of the Chinese building construction technology to be used as building field assessment linguistic data. The articles contain a large number of terms and concepts in the field of building engineering, and 3200 sentences of linguistic data are selected from the terms and concepts to form a building field linguistic data test set.

The point location table of the construction equipment refers to specific installation information of the construction equipment in the engineering project, and includes an equipment name, an equipment number and an equipment installation position, so that the space information and the equipment information in the engineering project can be effectively provided, and the point location table can be used as a test data set of the segmentation in the test on the engineering project, as shown in table 3.

TABLE 3 construction equipment point location example table

Numbering	Loop circuit	Address	Type of device	At the position
					1	048	001	Smoke detector	Library two-layer public reading
2	048	002	Audible and visual alarm	Two-layer southeast staircase of library
					3	048	003	Signal butterfly valve	Outside the two-layer east toilet of library
4	048	004	Fire display panel	Two-layer northeast elevator front room of library

(4) Comparison method

The method respectively verifies the accuracy of the CRF and HMM combined self-defining dictionary in the project self-adaptive Chinese word segmentation. Specifically, a HanLP platform (pyHanlp-0.1.48) is adopted to realize a method for combining CRF with increment training, and a Jieba platform (Jieba-0.39) is adopted to realize a method for combining HMM with a custom dictionary. The two tool modules used both provide the word segmentation function based on the large corpus and support the expansion of the custom features. The accuracy and the comprehensive performance of HanLP and Jieba on PKU and MSR of the public test set exceed 80 percent.

(5) Results of the experiment

Firstly, the inadaptability of the original Chinese word segmentation model in the building field is verified. Specifically, a manual collection method is adopted to collect 13376 phrases in total for the building and the professional phrases in the related fields thereof from the professional knowledge base and the building specification to form a building field term dictionary. And then, performing word segmentation by using two word segmentation tools of HanLP and Jieba, and counting the number and accuracy of correctly segmenting word groups, wherein the experimental results are shown in Table 4. The accuracy of the HanLP platform is 34.87%, the accuracy of the Jieba platform is 38.19%, and the word segmentation accuracy of each platform on the collected professional phrases does not exceed 40%, which indicates that the existing Chinese word segmentation platform is poor in adaptivity in the field of building information.

TABLE 4 wording comparison of HanLP and Jieba terms in the architectural field

The building field corpus test set is used for word segmentation test, and word segmentation accuracy, recall rate and F value of each platform in the building field corpus test set are low. Specifically, the word segmentation accuracy of HanLP is 68.03%, the recall rate is 82.49%, and the F value is 74.57%; the word segmentation accuracy of Jieba is 69.74%, the recall rate is 83.29%, and the F value is 76.23%. The above experimental results further illustrate that the existing chinese word segmentation platform has poor adaptivity in the field of building information.

Next, the influence of the domain term dictionary on the HanLP and JieBa participle platform is verified. After a domain term dictionary is added, the word segmentation accuracy of HanLP on a corpus test set in the building domain is improved from 68.03% to 70.73%, the recall rate is improved from 82.49% to 83.83%, the F value is improved from 74.57% to 76.72%, the accuracy is improved by 2.70%, and the F value is improved by 2.15%; the word segmentation accuracy of Jieba is improved from 69.74% to 72.15%, the recall rate is improved from 83.29% to 84.48%, the F value is improved from 76.23% to 77.83%, the accuracy is improved by 2.41%, and the F value is improved by 1.60%. The experimental results show that the performance of the Chinese participles on the corpus test set in the building field is obviously improved, and the method for adding the building field term dictionary effectively improves the adaptivity of the Chinese participles on the building information field level.

And then, on the basis of improving the word performance of the building information field level, verifying the word segmentation effect of the word segmentation model at the model level by using the equipment point position table. And taking a gallery equipment point position table in a university great school district as test data. Before and after the domain term dictionary is added, the word segmentation accuracy of LTP is only improved to 16.72% from 14.29%, the word segmentation accuracy of HanLP is only improved to 14.98% from 11.5%, the word segmentation accuracy of Jieba is only improved to 15.68% from 13.59%, and the maximum word segmentation accuracy is only improved to 3.48%. Obviously, the segmentation model loaded with the domain dictionary has a very limited improvement on the segmentation accuracy of the model information in the point location table of the building equipment, and the model has poor adaptability, so that in the field of building information, the Chinese segmentation is not enough to be optimized only at the level of the building domain, and the model characteristic needs to be supplemented.

Finally, a BIM model dictionary is added, and the table 5 is the accuracy comparison of the word segmentation platforms in different stages. After the BIM model dictionary is added, on the equipment point table test set, the word segmentation accuracy of HanLP is improved from 14.98% to 92.68%, the word segmentation accuracy of Jieba is improved from 15.68% to 92.68%, and the word segmentation accuracy is obviously improved.

TABLE 5 comparison of accuracy rates of different stage segmentation platforms for loading dictionaries

By adding the construction field term dictionary and the BIM model dictionary, the Chinese word segmentation accuracy of the equipment point location table information is improved by several times, the HanLP accuracy is improved by 8.06 times, and the Jieba accuracy is improved by 6.81 times.

Experimental results show that the building information model level and the domain level word segmentation accuracy are effectively improved by adding the BIM model dictionary and the building domain term dictionary to Chinese word segmentation, and the BIM model dictionary and the building domain term dictionary are used for improving the model adaptivity of Chinese word segmentation in the building information domain.

In another embodiment of the present invention, an adaptive Chinese word segmentation apparatus for building information model is provided, which is used to implement the methods in the foregoing embodiments. Therefore, the description and definition in the foregoing embodiments of the building information model adaptive chinese word segmentation method may be used for understanding each execution module in the embodiments of the present invention. Fig. 7 is a schematic diagram of an overall structure of a building information model adaptive chinese word segmentation apparatus provided in an embodiment of the present invention, where the apparatus includes a construction module 701, an embedding module 702, and a word segmentation module 703, where:

the building module 701 is used for building a BIM model dictionary according to the target BIM model;

the target BIM model is a BIM model needing data mining. The construction module 701 extracts the feature information of the actually used construction project from the target BIM model to construct a BIM model dictionary. The BIM model dictionary is a dictionary for a specific BIM model, and includes newly-appearing feature information in a building project in addition to related data of a building field.

The embedding module 702 is used for embedding a pre-constructed building domain term dictionary and the BIM model dictionary in a pre-trained Chinese word segmentation model;

the performance of the Chinese word segmentation model is optimized by collecting the terms and concepts of the building field and then establishing a dictionary of the terms of the building field. However, the building information field has not only a large number of terms and concepts common to the building field, but also very individual spatial relationships and building attribute information in the project model. Therefore, the embodiment not only improves the field adaptability of the Chinese word segmentation model in the building information field by establishing the building field term dictionary, but also improves the model adaptability of the Chinese word segmentation model in the building information field by establishing the BIM model dictionary. The Chinese word segmentation model does not need to rely on the establishment of a huge professional training corpus in the cross-field aspect, and the application range of the Chinese word segmentation model in other professional fields can be greatly expanded.

The word segmentation module 703 is configured to segment words of the building domain sentence to be segmented based on the chinese word segmentation model embedded in the BIM model dictionary and the building domain term dictionary.

On the basis of the above embodiment, the building module in this embodiment is specifically configured to: screening out attributes containing characteristic information of the building project from all IFC objects of the target BIM model; constructing a project characteristic information model according to the attributes; wherein, the attribute in the project characteristic information model is different in the characteristic information of the building project contained in different IFC objects; and constructing the BIM model dictionary according to the project characteristic information model.

On the basis of the foregoing embodiment, the embedded module in this embodiment is specifically configured to: and embedding a pre-constructed building field term dictionary and the BIM model dictionary into a pre-trained Chinese word segmentation model based on a dictionary embedding algorithm of the double-array Trie tree.

On the basis of the foregoing embodiment, the embedded module in this embodiment is further configured to: building a wire tree of the building domain term dictionary and a wire tree of the BIM model dictionary; constructing a double array of the building field term dictionary according to the wire tree of the building field term dictionary, and constructing a double array of the BIM model dictionary according to the wire tree of the BIM model dictionary; adding the double array of the building domain term dictionary and the double array of the BIM model dictionary to the trained Chinese segmentation model.

On the basis of the foregoing embodiment, the segmentation module in this embodiment is specifically configured to: segmenting words of the sentence to be segmented in the building field based on the Chinese word segmentation model embedded in the BIM model dictionary and the building field term dictionary to obtain an initial word segmentation result; optimizing the word segmentation result based on the building domain term dictionary in the Chinese word segmentation model embedded into the BIM model dictionary and the building domain term dictionary to obtain an optimized initial word segmentation result; and optimizing the optimized initial word segmentation result again based on the BIM model dictionary in the Chinese word segmentation model embedded into the BIM model dictionary and the building field term dictionary to obtain the optimized initial word segmentation result again.

On the basis of the above embodiments, the embodiment further includes a training module, configured to train the chinese word segmentation model based on an HMM method or a CRF method.

On the basis of the foregoing embodiments, the building module in this embodiment is further configured to: screening out term phrases in the building field from a professional knowledge base in the building field; and constructing the construction field term dictionary according to the term phrases.

Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: a processor (processor)801, a communication Interface (Communications Interface)802, a memory (memory)803 and a communication bus 804, wherein the processor 801, the communication Interface 802 and the memory 803 complete communication with each other through the communication bus 804. The processor 801 may call logic instructions in the memory 803 to perform the following method: building a BIM model dictionary according to the target BIM model; embedding a pre-constructed building field term dictionary and the BIM model dictionary into a pre-trained Chinese word segmentation model; and segmenting words of the sentence to be segmented in the building field based on the Chinese word segmentation model embedded into the BIM model dictionary and the building field term dictionary.

In addition, the logic instructions in the memory 803 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the methods provided by the above method embodiments, for example, including: building a BIM model dictionary according to the target BIM model; embedding a pre-constructed building field term dictionary and the BIM model dictionary into a pre-trained Chinese word segmentation model; and segmenting words of the sentence to be segmented in the building field based on the Chinese word segmentation model embedded into the BIM model dictionary and the building field term dictionary.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A Chinese word segmentation method for building information model self-adaptation is characterized by comprising the following steps:

building a BIM model dictionary according to the target BIM model;

2. The building information model adaptive Chinese word segmentation method of claim 1, wherein the step of constructing a BIM model dictionary according to the target BIM model comprises:

3. The building information model adaptive Chinese word segmentation method according to claim 1, wherein the step of embedding a pre-constructed building domain term dictionary and the BIM model dictionary in a pre-trained Chinese word segmentation model comprises:

4. The building information model adaptive Chinese word segmentation method of claim 3, wherein the step of embedding a pre-constructed building domain term dictionary and the BIM model dictionary in a pre-trained Chinese word segmentation model based on a dictionary embedding algorithm of a double-array Trie tree comprises:

5. The building information model adaptive Chinese word segmentation method according to claim 1, wherein the step of segmenting words of the sentence to be segmented in the building field based on the Chinese word segmentation model embedded in the BIM model dictionary and the building field term dictionary comprises:

6. The building information model adaptive Chinese word segmentation method according to any one of claims 1 to 5, wherein the step of embedding a pre-constructed building domain term dictionary and the BIM model dictionary in a pre-trained Chinese word segmentation model further comprises:

7. The building information model adaptive Chinese word segmentation method according to any one of claims 1 to 5, wherein the step of embedding a pre-constructed building domain term dictionary and the BIM model dictionary in a pre-trained Chinese word segmentation model further comprises:

8. A building information model self-adaptive Chinese word segmentation device is characterized by comprising the following components:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the building information model adaptive chinese word segmentation method according to any one of claims 1 to 7.

10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, performs the steps of the building information model adaptive chinese word segmentation method according to any one of claims 1 to 7.