CN111178051B

CN111178051B - Self-adaptive Chinese word segmentation method and device for building information model

Info

Publication number: CN111178051B
Application number: CN201911404637.6A
Authority: CN
Inventors: 周小平; 张鑫; 王佳
Original assignee: Bim Winner Beijing Technology Co ltd
Current assignee: Bim Winner Beijing Technology Co ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2024-05-17
Anticipated expiration: 2039-12-30
Also published as: CN111178051A

Abstract

The embodiment of the invention provides a self-adaptive Chinese word segmentation method and device for a building information model, wherein the method comprises the following steps: constructing a BIM model dictionary according to the target BIM model; embedding a pre-built building domain term dictionary and the BIM model dictionary into a pre-trained Chinese word segmentation model; and word segmentation is carried out on the sentences to be segmented in the building field based on the Chinese word segmentation model embedded in the BIM model dictionary and the building field term dictionary. When the BIM model is required to be subjected to data mining, the characteristic information actually used in the BIM model is used for optimizing the Chinese word segmentation model in a targeted manner, so that the word segmentation result is more suitable for the data mining of the BIM model, the self-adaptive Chinese word segmentation of the BIM model is realized, the efficiency and the accuracy of the Chinese word segmentation in the field of building information are improved, the universality of the existing BIM model searching, entity matching and other methods is improved, and the application range of the method is widened.

Description

Self-adaptive Chinese word segmentation method and device for building information model

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a self-adaptive Chinese word segmentation method and device for a building information model.

Background

The building information model (Building Information Modeling, BIM) is a digital information model that records physical and functional characteristics of the building facility. The BIM contains detailed information of each stage in the whole life cycle of the building, so that interoperability of the life cycle data of the building is realized, and effective cooperation of all participants in the building engineering is promoted. Currently, BIM has become an effective solution and an important trend for informatization of Construction (ENGINEERING AND Architecture) engineering, and has been widely studied and applied in Construction enterprises.

Typically, a BIM model records complete data information for an engineering project. Current research on BIM is mostly developed for one or more BIM models. In order to improve the application efficiency of the BIM model, a part of scholars develop researches such as information retrieval, entity matching and the like facing the BIM model. The performance of these studies is directly affected by the word segmentation effect. Word segmentation refers to the segmentation of a text sequence into words, and is the fundamental and first-order step in many BIM data mining processes. The good word segmentation can improve the efficiency of methods such as model-level information retrieval, entity matching and the like, increase the universality of the methods and enlarge the application range of the methods.

For example, when the "double-click double-control switch between five layers of northeast strong electric rooms" is searched, if the search content can be correctly divided into "five layers", "between northeast strong electric rooms", "and" double-click double-control switch ", the information search system is helped to quickly and accurately locate the corresponding component in the BIM model. However, the word segmentation accuracy of the current mainstream word segmentation method is affected by training corpus and application field, and is difficult to be directly applied to BIM-based project-level data mining research such as information retrieval and entity matching. For example, the word segmentation results of the current mainstream word segmentation method are "five-layer", "northeast", "strong", "inter-electric", "double-click", "double-control" and "switch", or "five-layer", "northeast", "strong inter-electric", "double-click" and "double-control switch". Although some scholars have developed word segmentation methods for the building field, they have poor adaptability in specific BIM models.

Disclosure of Invention

In order to solve the problems that the word segmentation result of the existing Chinese word segmentation method cannot be directly applied to data mining of a BIM model and has poor adaptability or at least partially solve the problems, the embodiment of the invention provides a self-adaptive Chinese word segmentation method and device for a building information model.

According to a first aspect of an embodiment of the present invention, there is provided a method for chinese word segmentation for building information model adaptation, including:

Constructing a BIM model dictionary according to the target BIM model;

Embedding a pre-built building domain term dictionary and the BIM model dictionary into a pre-trained Chinese word segmentation model;

And word segmentation is carried out on the sentences to be segmented in the building field based on the Chinese word segmentation model embedded in the BIM model dictionary and the building field term dictionary.

Specifically, the step of constructing a BIM model dictionary from the target BIM model includes:

screening out the attribute containing the characteristic information of the building project from each IFC object of the target BIM;

Constructing a project characteristic information model according to the attribute; wherein, the attribute in the project characteristic information model is different in the characteristic information of the building project contained in different IFC objects;

And constructing the BIM model dictionary according to the project characteristic information model.

Specifically, the step of embedding a pre-built architectural domain term dictionary and the BIM model dictionary in a pre-trained Chinese word segmentation model comprises the following steps:

and embedding a pre-built building domain term dictionary and the BIM model dictionary into a pre-trained Chinese word segmentation model based on a dictionary embedding algorithm of the double-array Trie tree.

Specifically, the step of embedding a pre-built building domain term dictionary and the BIM model dictionary into a pre-trained Chinese word segmentation model based on a dictionary embedding algorithm of a double-array Trie comprises the following steps:

Constructing a Tire tree of the building field term dictionary and a Tire tree of the BIM model dictionary;

Constructing a double array of the building field term dictionary according to the Tire tree of the building field term dictionary, and constructing a double array of the BIM model dictionary according to the Tire tree of the BIM model dictionary;

and adding the double-array of the building field term dictionary and the double-array of the BIM model dictionary into the trained Chinese word segmentation model.

Specifically, the step of word segmentation of the sentence to be segmented in the building field based on the Chinese word segmentation model embedded in the BIM model dictionary and the building field term dictionary comprises the following steps:

Performing word segmentation on a sentence to be segmented in the building field based on a Chinese word segmentation model embedded in front of the BIM model dictionary and the building field term dictionary to obtain an initial word segmentation result;

optimizing the word segmentation result based on the building domain term dictionary embedded in the BIM model dictionary and the building domain term dictionary in the Chinese word segmentation model after the building domain term dictionary, and obtaining an optimized initial word segmentation result;

and re-optimizing the optimized initial word segmentation result based on the BIM model dictionary embedded in the Chinese word segmentation model after the BIM model dictionary and the building field term dictionary, and obtaining the re-optimized initial word segmentation result.

Specifically, the method further comprises the following steps of embedding a pre-built building domain term dictionary and the BIM model dictionary in a pre-trained Chinese word segmentation model:

Training the Chinese word segmentation model based on an HMM method or a CRF method.

Screening out term phrase in the building field from the professional knowledge base in the building field;

And constructing the construction field term dictionary according to the term phrase.

According to a second aspect of the embodiment of the present invention, there is provided a chinese word segmentation apparatus adapted to a building information model, including:

the construction module is used for constructing a BIM model dictionary according to the target BIM model;

The embedding module is used for embedding a pre-built building domain term dictionary and the BIM model dictionary into a pre-trained Chinese word segmentation model;

And the word segmentation module is used for segmenting words of sentences to be segmented in the building field based on the Chinese word segmentation model embedded in the BIM model dictionary and the building field term dictionary.

According to a third aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor invoking the program instructions to be able to perform the building information model adaptive chinese word segmentation method provided by any one of the various possible implementations of the first aspect.

According to a fourth aspect of embodiments of the present invention, there is further provided a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the method of chinese word segmentation for building information model adaptation provided by any one of the various possible implementations of the first aspect.

The embodiment of the invention provides a self-adaptive Chinese word segmentation method and device of a building information model, which constructs a self-adaptive BIM model dictionary by facing a specific BIM model, provides model-level corpus for Chinese word segmentation, then embeds the BIM model dictionary and a construction field term dictionary into a Chinese word segmentation model, and uses feature information actually used in the BIM model to optimize the Chinese word segmentation model in a targeted manner when the BIM model needs to be subjected to data mining, so that word segmentation results are more suitable for the data mining of the BIM model, the self-adaptive Chinese word segmentation of the BIM model is realized, the efficiency and accuracy of Chinese word segmentation in the construction information field are improved, the universality of the existing BIM model retrieval, entity matching and other methods is facilitated, and the application range of the method is enlarged.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic overall flow chart of a self-adaptive Chinese word segmentation method for a building information model according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an overall architecture of a method for adaptively segmenting chinese words in a building information model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a Trie tree based on double groups in a self-adaptive Chinese word segmentation method for building information models according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a Trie structure of a term dictionary in the building field in the building information model adaptive Chinese word segmentation method according to the embodiment of the present invention;

FIG. 5 is a diagram showing the initial word segmentation result before embedding BIM model dictionary and building domain term dictionary in the self-adaptive Chinese word segmentation method for building information model provided by the embodiment of the invention;

FIG. 6 is a word segmentation result after embedding a BIM model dictionary and a building domain term dictionary in the building information model adaptive Chinese word segmentation method provided by the embodiment of the invention;

FIG. 7 is a schematic diagram of the overall structure of a Chinese word segmentation device with self-adaptive building information model according to the embodiment of the present invention;

fig. 8 is a schematic diagram of an overall structure of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The building information model (Building Information Modeling, BIM) is a complete digital representation of engineering facility entities and their characteristics, aimed at achieving information integration and sharing throughout the life cycle of a building. As an information interaction manner throughout the life cycle of a building, BIM is considered as an effective means for solving the problems of information islands, information loss, and the like in the building industry.

The industry base class (Industry Foundation Class, IFC) is an open and standardized database aimed at enabling interoperability between information modeling software applications built in the AEC and FM (FACILITY MANAGEMENT ) industries, enabling efficient information flow throughout the life cycle of a building. Among the various building data model exchange formats, the IFC standard is the only public, non-proprietary data model employed in the world today. The IFC provides a feasible expansion mechanism and a clear semantic information structure, and lays a solid foundation for acquiring information in the BIM. Because the industry base class is an international standard for current BIMs, without loss of generality, embodiments of the application consider BIM data to be organized in IFC format.

Currently, with the continuous development of BIM technology and natural language processing technology, researchers have proposed many application schemes of combining BIM technology and natural language processing technology, but the premise of applying BIM in chinese information processing scenes is that words can be correctly segmented. Therefore, the model adaptability of Chinese word segmentation in the field of building information is improved, a feasible word segmentation foundation is provided for the application schemes, and the usability of the application schemes is improved.

However, no deep exploration and research is currently being conducted on word segmentation links in chinese scenes. In the field of building information, a large number of specialized corpora are scattered in unstructured drawings and text data, and a large specialized corpus is difficult to build. Meanwhile, new spatial nouns and component attributes often appear in engineering projects, and how to integrate the characteristics in the projects into a Chinese word segmentation model is not known at present.

The Chinese word segmentation model is a model for segmenting Chinese text, namely a Chinese character sequence into words, and is the basic work of Chinese information processing in the field of natural language processing. Machine translation, speech synthesis, search engines, automatic abstracts, etc. all require chinese word segmentation models as a basis. The statistical machine learning model is superior to the traditional rule method in Chinese word segmentation task. Conventional statistical machine learning models for word segmentation can be largely divided into two main classes, namely, based on word labeling and word learning, according to the structural decomposition unit processed. Common models for training chinese word segmentation models include hidden markov models (Hidden Markov Model, HMM), maximum entropy models (Maximum Entropy, ME), and conditional random fields (Conditional Random Fields, CRF), which all have excellent performance in training chinese word segmentation models.

The word segmentation of Chinese can be regarded as a problem of labeling Chinese character sequences, and the most widely applied 4-Tag labeling method is shown in Table 1.

Table 1 4-Tag examples

Wherein B represents the beginning of word segmentation, E represents the end of word segmentation, M represents the middle part, and S represents a single word.

To facilitate an understanding of embodiments of the present invention, the following definitions are given:

Definition 1 building information model (BIM, b) the building information model is a complete digital representation of building geometry information and semantic information, containing attribute information of building items, components, spaces, etc., denoted by b in this embodiment, and b= { f ₁,f₂,…,f_n }, f denotes one IFC object.

Definition 2 Dictionary (D): in chinese segmentation, words are a set of word sequences with fixed features, and a dictionary is a collection of words, denoted by D. For example, "northeast heavy electric room" is a word, and { "northeast heavy electric room", "double click double control switch", … } is a dictionary.

Definition 3 architectural domain term dictionary (Dictionary of Building Domain Terms, D _d): the term dictionary in the construction field comprises the technical terms of each related specialty of the construction field, such as terms in the construction, electric, water supply and drainage and other specialty and terms in engineering projects, and is denoted by D _d.

Defining a 4BIM model dictionary (Dictionary of BIM, D _b) BIM model dictionary is a dictionary containing project model feature information, which refers to words in the BIM model that refer to specific building elements or spatial regions, the BIM model dictionary being denoted by D _b.

Defining 5 project characteristic information models (Project Feature Information Model, M), wherein the project characteristic information models are sets of attribute information extracted based on IFC standards, are subsets of IFC object attributes, and provide basis for extracting BIM model dictionary. The project characteristic information model is denoted by M, and m= { k ₁,k₂,…,k_n }, k denotes an attribute in the IFC standard.

In one embodiment of the present invention, a method for adaptively segmenting chinese words in a building information model is provided, and fig. 1 is a schematic overall flow diagram of the method for adaptively segmenting chinese words in a building information model provided in the embodiment of the present invention, where the method includes: s101, constructing a BIM model dictionary according to a target BIM model;

The target BIM model is a BIM model needing data mining. And extracting characteristic information of the actually used building project from the target BIM model to construct a BIM model dictionary. The BIM model dictionary is a dictionary oriented to a specific BIM model, and contains relevant data of the building field and newly-appearing characteristic information in the building project.

S102, embedding a pre-built building domain term dictionary and the BIM model dictionary into a pre-trained Chinese word segmentation model;

In the aspect of field self-adaptability, when the application field of the Chinese word segmentation model is inconsistent with the field of training the Chinese word segmentation model corpus, the word segmentation accuracy of the Chinese word segmentation model based on statistical machine learning is greatly reduced. The annotation corpus is a large training data set which is audited and released by an authority, and in the field of building information, because the data related to the corpus in the field of building are scattered in each link in practice and exist in a large number in unstructured data form, the authority annotation corpus is difficult to build comprehensively.

Therefore, the present embodiment optimizes the performance of the Chinese word segmentation model by gathering building domain terms and concepts and then building a dictionary of building domain terms. However, the building information field not only has a large number of general terms and concepts of the building field, but also has very personal spatial relationships and construction attribute information in the project model. Therefore, the method and the device not only improve the field adaptability of the Chinese word segmentation model in the field of building information by establishing a term dictionary in the building field, but also improve the model adaptability of the Chinese word segmentation model in the field of building information by establishing a BIM model dictionary. The cross-domain Chinese word segmentation model is not required to be dependent on building a huge professional training corpus, and the application range of the Chinese word segmentation model in other professional fields can be greatly improved.

S103, word segmentation is carried out on the sentence to be segmented in the building field based on the Chinese word segmentation model embedded in the BIM model dictionary and the building field term dictionary.

In order to improve the field adaptability and model adaptability of the Chinese word segmentation model in the building information field, the BIM model self-adaptive Chinese word segmentation method provided by the embodiment is mainly divided into three parts, namely training of the Chinese word segmentation model, construction and introduction of a term dictionary in the building field, and construction and introduction of the BIM model dictionary. The overall architecture is shown in fig. 2, comprising:

Training a Chinese word segmentation model, namely training the Chinese word segmentation model by using a common labeling corpus through a statistical machine learning method, wherein in the experimental tool adopted in the embodiment, two statistical machine learning methods of HMM and CRF are mainly adopted;

Secondly, adding a building domain term dictionary into the Chinese word segmentation model, constructing a domain term dictionary, and embedding the domain term dictionary into the Chinese word segmentation model to improve the domain adaptability of the Chinese word segmentation model in the building information domain, wherein the building domain term dictionary in the embodiment mainly comes from a professional domain knowledge base and building specifications;

Thirdly, adding a BIM model dictionary into the Chinese word segmentation model, establishing a project characteristic information model from the IFC standard, extracting the BIM model dictionary from the BIM according to the project characteristic information model, embedding the BIM model dictionary into the Chinese word segmentation model, and improving the model adaptability of the Chinese word segmentation model in the field of building information.

According to the embodiment, the self-adaptive BIM model dictionary is built for a specific BIM model, model-level corpus is provided for Chinese word segmentation, and then the BIM model dictionary and the building field term dictionary are embedded into the Chinese word segmentation model.

On the basis of the above embodiment, the step of constructing a BIM model dictionary according to the target BIM model in this embodiment includes: screening out attributes containing building project characteristic information from each IFC object of the target BIM; constructing a project characteristic information model according to the attribute; wherein, the attribute in the project characteristic information model is different in the characteristic information of the building project contained in different IFC objects; and constructing the BIM model dictionary according to the project characteristic information model.

Specifically, according to the IFC standard, the attribute containing the characteristic information of the building project is screened from all the objects of the IFC, so that a project characteristic information model is constructed, and preparation is made for extracting a BIM model dictionary in the next step. The IFC standard has evolved over decades, ending IFC version 4.0, with more than 653 entities and 300 more supplemental data types in the IFC, as well as an extensible set of attributes. In order to construct the BIM model dictionary efficiently, the attribute set containing the item characteristic information needs to be screened out in the IFC standard.

In the IFC standard ifcProduct is an abstract representation of any object related to a geometric or spatial context, ifcProperty is an abstract summary of all types of properties that can be associated with the IFC object through a property set mechanism, ifcProject establishes context for information to be exchanged and shared, can represent a building item, ifcRelationship is an abstract summary of all objectified relationships in the IFC, IFCSPITIALELEMENT is a summary of all spatial elements that can be used to define a spatial structure or spatial region.

The attributes in the project characteristic information model represent different project characteristic information in the IFC object. For example, the Name attribute of ifcProduct objects represents the component Name; names and NominalValue denote the attribute names and attribute information contained in ifcProperty; the ifcProject object also has LongName attribute for representing the project academic name and Phase attribute for representing the links where the object is located in the building life cycle, and the project characteristic information corresponding to these attributes is required by the Chinese word segmentation model.

Meanwhile, the same attribute represents different item feature information in different IFC objects. For example, name represents the component Name in ifcProduct class, the item Name in ifcProject, and the association between objects in ifcRelationship. LongName denotes the academic name of the item in ifcProject and the full name of the space in IFCSPITIALELEMENT.

Currently, IFC file formats are a broad form of BIM data storage, which is considered as a collection of IFC instances, i.e., b= { f ₁,f₂,…,f_n }. The IFC instance contains project characteristic information required by the project characteristic information model for constructing the BIM dictionary. And traversing the IFC instance from the IFC file, and extracting project characteristic information from the IFC instance to construct the BIM model dictionary. The built BIM model dictionary D _b and the building domain term dictionary D _d are embedded into a Chinese word segmentation model, so that the model adaptability and the domain adaptability of Chinese word segmentation in the building information domain are effectively improved.

On the basis of the above embodiment, the step of embedding the pre-built building domain term dictionary and the BIM model dictionary in the pre-trained chinese word segmentation model in this embodiment includes: and embedding a pre-built building domain term dictionary and the BIM model dictionary into a pre-trained Chinese word segmentation model based on a dictionary embedding algorithm of the double-array Trie tree.

Specifically, the building domain term dictionary D _d can be formed by only screening out related term phrases in the building domain from the building domain expert knowledge base thanks to the building domain expert knowledge base. In order to improve word segmentation performance of the statistical Chinese word segmentation model in the professional field, a dictionary embedding algorithm based on DAT (Double ARRAY TRIE, double-array Trie) is adopted in the embodiment.

The DAT-based dictionary embedding algorithm implements a Trie tree with two tuples base and check. Defining the input character as c, a transition from state s to state t occurs. As shown in FIG. 3, states s and t correspond to array indices. When the input is c, the state s transitions to the state t, t=base [ s ] +c. In a check array parallel to the base array, check t=s, i.e. the check array records from which state the t state is transformed.

Based on the above embodiment, the step of embedding the pre-built building domain term dictionary and the BIM model dictionary into the pre-trained chinese word segmentation model based on the dictionary embedding algorithm of the double-array Trie in the present embodiment includes: constructing a Tire tree of the building field term dictionary and a Tire tree of the BIM model dictionary; constructing a double array of the building field term dictionary according to the Tire tree of the building field term dictionary, and constructing a double array of the BIM model dictionary according to the Tire tree of the BIM model dictionary; and adding the double-array of the building field term dictionary and the double-array of the BIM model dictionary into the trained Chinese word segmentation model.

Specifically, a dictionary example of the tree structure is shown in fig. 4, where a box represents a word node, and a circle represents the end of a word, i.e., a termination node. In fig. 4, the words that can be identified are terms such as "double click double control", "double click double control switch", "strong electric room", "hyperbolic vault" and "northeast strong electric room". Taking the search of the strong electric between northeast as an example, firstly searching the eastern word, then searching the north word, wherein the north word is on the subtree of the eastern word, and the north word is not in a termination state, which means that the northeast word is a part of a term vocabulary, and sequentially searching until the termination state is completed, and the matching of the northeast strong electric between the words is completed, so that the dictionary matching is completed.

After the Tire tree of the building domain term dictionary and the BIM model dictionary is built, the nodes in the Trie tree are encoded according to the hierarchical traversal sequence, namely, the root node is 0, east is 1, strength is 2, double is 3, north is 4, electricity is 5, hit is 6, curve is 7, space is 8, arch is 9, control is 10, top is 11, opening is 12 and closing is 13. And assigning values to the base and check arrays according to the values. Let the initial test values of the array be 0, take the construction of strong electricity as an example, the strong sequence code is 2, the corresponding index is 2, the electric word sequence code is 5, and the strong electricity needs to meet the condition that base [ a+5] =check [ a+5] =0. a is a positive integer, i.e., optional a=1, then the index value corresponding to "strong current" is 6, check < 6 > =2. And the base [2] +5=6 is satisfied, the base [2] =1 can be determined, the subscript of the strong inter-current is 10, the check [10] =6, the status code of the inter-current is 8, and the base [6] =2 can be determined. If a word is in a stop state at the current word, the corresponding subscript is i, then base [ i ] = -base [ i ], and if base [ i ] = = 0, then base [ i ] = -i. The double array constructed based on fig. 4 is shown in table 2.

Table 2 double-array example table

For example, for the spatial term "inter-strong electric", the search step based on table 2 is as follows:

step 1, finding the subscript 2 of the state 'strong' according to the 'strong' sequence code 2.

Step 2, find the subscript base [2] +5=6 of "strong current" according to the sequence code 5 of the next word "strong" in combination with the last word "strong", and check [ base [2] +5] =check [6] =2, and according to base [6] = 2>0, indicate that "strong current" is part of a word and start with "strong", continue searching.

And 3, finding a state of 'strong electric room', wherein the subscript of the state is base [6] +8=10, and the base [10] <0 at the moment, and finishing the finding.

On the basis of the above embodiment, the step of performing word segmentation on the sentence to be segmented in the building domain based on the Chinese word segmentation model embedded in the BIM model dictionary and the building domain term dictionary in the embodiment includes: performing word segmentation on a sentence to be segmented in the building field based on a Chinese word segmentation model embedded in front of the BIM model dictionary and the building field term dictionary to obtain an initial word segmentation result; optimizing the word segmentation result based on the building domain term dictionary embedded in the BIM model dictionary and the building domain term dictionary in the Chinese word segmentation model after the building domain term dictionary, and obtaining an optimized initial word segmentation result; and re-optimizing the optimized initial word segmentation result based on the BIM model dictionary embedded in the Chinese word segmentation model after the BIM model dictionary and the building field term dictionary, and obtaining the re-optimized initial word segmentation result.

Specifically, the initial word segmentation result before embedding the BIM model dictionary and the architectural domain term dictionary is shown in fig. 5, and the situation that BB and the like are unlikely to exist is removed in fig. 5. Before adding the BIM model dictionary and the building field term dictionary, the space term "northeast strong electric room" has various word segmentation possibilities, and can be marked as "northeast", "strong electric room" or "northeast", "strong" and "electric room". The architectural domain term "double click double control switch" may be labeled as "double click", "double control", "switch" or "double click", "double control switch". The word segmentation result after embedding the BIM model dictionary and the architectural domain term dictionary is shown in fig. 6. The model term of the northeast strong electric room and the building field term of the double-click double-control switch are correctly segmented, so that the redundancy condition of word segmentation sequences is greatly reduced, and the word segmentation accuracy is effectively improved.

According to the method, the BIM model dictionary and the building domain term dictionary are embedded into the Chinese word segmentation model, so that the defects of the Chinese word segmentation model in the cross-domain aspect are improved, and the adaptability of Chinese word segmentation in the professional domain is improved.

On the basis of the above embodiment, the step of embedding the pre-built building domain term dictionary and the BIM model dictionary in the pre-trained chinese word segmentation model in this embodiment further includes: training the Chinese word segmentation model based on an HMM method or a CRF method.

Specifically, in the statistical training method of the Chinese word segmentation model, two methods, namely an HMM and a CRF, are widely applied. Wherein the HMM is a joint probability distribution statistical model based on a markov process and the CRF is a conditional probability distribution statistical model based on a markov random field. In contrast to HMMs and CRFs, which take into account varying information between data content and data tags when modeling, their correlation models achieve better results in many natural language processing tasks. In this embodiment, a CRF is taken as an example to introduce a training method for chinese word segmentation.

CRF learns the probability function mapping from the labeling corpus x= (X ₁,X₂,…,X_n) to the word sequence state y= (Y ₁,Y₂,…,Y_n). In the experimental tool of this example, a linear chain conditional random field model was used, and its parameterized form was shown in formula (1).

Wherein X is the value of the labeling corpus X, Y is the value of the word state Y, t _k and s _l are characteristic functions, and lambda _k and mu _l are corresponding weights. Z (x) is a normalization factor or normalization factor, which is the sum of all possible state sequences, as shown in formula (2).

For simplicity, the transfer features and state features and their weights are represented by uniform symbols, and are provided with K ₁ transfer features, K ₂ state features, K=K ₁+K₂, denoted as

The transition and state features are then summed at each location i, denoted as

The weights of features f _k (y, x) are denoted by w _k, i.e

That is, the conditional random field can be reduced to the formula (3) and the formula (4), which are expressed as

After a statistical model of P (y|x) is established by using CRF, the chinese word segmentation task obtains that y ^* satisfies P (y|x) the maximum, and Z (x) is independent of y, so y ^* is expressed by formula (5).

The optimal word segmentation result y ^* can be obtained by using Viterbi algorithm.

The test verification is performed in this example.

(1) Evaluation index

The test standard of word segmentation performance is mainly divided into accuracy, recall and F value, which are respectively represented by P, R, F. The accuracy rate represents the accuracy of word segmentation of the word segmentation model, the recall rate is also called recall rate, the ratio of the word segmentation model to the correct result is represented, and the F value comprehensively reflects the overall index of the word segmentation model. The calculation method is as follows.

(2) Experimental environment

The hardware environment is CPU Intel Core i 7.8 GHz, memory 16G. The operating system used is MacOS 10.14.6. All experiments of this embodiment were carried out in the python language.

(3) Data test set

In Chinese word segmentation research in the field of building information, no evaluation corpus is disclosed. Therefore, 1300 articles on the management network of the Chinese building construction technology are crawled to be used for constructing the evaluation corpus of the building field. The articles contain a large number of terms and concepts in the field of constructional engineering, and 3200 sentence corpus is selected from the terms and concepts to form a corpus test set in the field of constructional engineering.

The construction equipment point position table refers to specific installation information of construction equipment in engineering projects, comprises equipment names, equipment numbers and equipment installation positions, can effectively provide space information and equipment information in the engineering projects, and can be used as a test data set for testing Chinese segmentation on the engineering projects, as shown in table 3.

Table 3 construction equipment point location example table

Numbering device	Loop circuit	Address of	Device type	Location of the position
					1	048	001	Smoke detector	Library two-layer public reading
2	048	002	Audible and visual alarm	Library two-layer southeast stairwell
					3	048	003	Signal butterfly valve	Outside of two-layer east bathroom of library
4	048	004	Fire display panel	Library two-layer northeast elevator front room

(4) Control method

The embodiment respectively verifies the accuracy of two methods of CRF combined with the custom dictionary and HMM combined with the custom dictionary in project self-adaptive Chinese word segmentation. Specifically, a method for realizing CRF combined incremental training by adopting HanLP platforms (pyHanlp-0.1.48) and a method for realizing HMM combined custom dictionary by adopting Jieba platforms (jieba-0.39) are adopted. The two tool modules used provide word segmentation functions based on a large corpus and support expansion of custom features. HanLP and Jieba have over 80% accuracy and overall performance on both public test sets PKU and MSR.

(5) Experimental results

Firstly, the inadaptability of the original Chinese word segmentation model in the building field is verified. Specifically, a manual collection method is adopted to collect 13376 phrases in total of the professional phrases in the building and the related fields from the professional knowledge base and the building specification, and the dictionary of the terms in the building field is formed. Then, the two word segmentation tools HanLP and Jieba are used for word segmentation, the number and accuracy of correctly segmented phrases are counted, and the experimental results are shown in table 4. The accuracy of HanLP platforms is 34.87%, the accuracy of Jieba is 38.19%, and the word segmentation accuracy of each platform for the collected professional phrases is not more than 40%, which indicates that the existing Chinese word segmentation platform has poor self-adaptability in the field of building information.

Table 4 HanLP and Jieba are word-segmented comparisons of construction domain terms

The corpus testing set in the building field is utilized to perform word segmentation testing, and the word segmentation accuracy, recall rate and F value of the corpus testing set in the building field are low. Specifically, hanLP has a word segmentation accuracy of 68.03%, a recall rate of 82.49% and an F value of 74.57%; jieba has a word segmentation accuracy of 69.74%, a recall rate of 83.29% and an F value of 76.23%. The experimental results further show that the existing Chinese word segmentation platform has poor adaptability in the field of building information.

Next, the impact of the domain term dictionary on the HanLP and JieBa word segmentation platforms is verified. After the field term dictionary is added, the word segmentation accuracy of HanLP on the building field corpus testing set is improved from 68.03% to 70.73%, the recall rate is improved from 82.49% to 83.83%, the F value is improved from 74.57% to 76.72%, the accuracy is improved by 2.70%, and the F value is improved by 2.15%; the word segmentation accuracy of Jieba is improved from 69.74% to 72.15%, the recall rate is improved from 83.29% to 84.48%, the F value is improved from 76.23% to 77.83%, the accuracy is improved by 2.41%, and the F value is improved by 1.60%. From experimental results, the performance of Chinese word segmentation on the corpus test set in the building field is obviously improved, and the self-adaptability of Chinese word segmentation on the building information field level is effectively improved by the method of adding the term dictionary in the building field.

And then, on the basis of improving the performance of the building information field-level word segmentation, verifying the word segmentation effect of the word segmentation model at the model level by using the equipment point position table. The library equipment point position list in the Xingjing area of university is used as test data. Before and after adding the domain term dictionary, the word segmentation accuracy of LTP is only improved from 14.29% to 16.72%, the word segmentation accuracy of HanLP is only improved from 11.5% to 14.98%, and the word segmentation accuracy of Jieba is only improved from 13.59% to 15.68%, and the maximum is only improved by 3.48%. Obviously, the word segmentation model loaded with the domain dictionary has very limited improvement of word segmentation accuracy of model information in a point position table of building equipment and poor model adaptability, so that in the field of building information, the word segmentation model is not enough to optimize the word segmentation in the building domain level only, and the model characteristic is required to be supplemented.

Finally, BIM model dictionary is added, and table 5 is the comparison of the accuracy of the word segmentation platforms at different stages. After the BIM model dictionary is added, on the equipment point position table test set, the word segmentation accuracy of HanLP is improved from 14.98% to 92.68%, and the word segmentation accuracy of Jieba is improved from 15.68% to 92.68%, so that the word segmentation accuracy is remarkably improved.

Table 5 comparison of accuracy of different stage word segmentation platforms loading dictionary

By adding the term dictionary and the BIM model dictionary in the building field, the word segmentation accuracy of Chinese word segmentation on the information of the equipment point position table is improved by several times, the accuracy of HanLP is improved by 8.06 times, and the accuracy of Jieba is improved by 6.81 times.

Experimental results show that by adding the BIM model dictionary and the building field term dictionary into the Chinese word segmentation, the word segmentation accuracy of the building information model level and the building field level is effectively improved, and the model adaptability of the Chinese word segmentation in the building information field is improved by using the BIM model dictionary and the building field term dictionary.

In another embodiment of the present invention, a chinese word segmentation apparatus for building information model adaptation is provided, which is used to implement the method in the foregoing embodiments. Therefore, the description and definition in the embodiments of the Chinese word segmentation method adaptive to the building information model can be used for understanding each execution module in the embodiment of the invention. Fig. 7 is a schematic diagram of the overall structure of a chinese word segmentation device with self-adaptive building information model according to an embodiment of the present invention, where the device includes a building module 701, an embedding module 702, and a word segmentation module 703, where:

The building module 701 is configured to build a BIM model dictionary according to the target BIM model;

The target BIM model is a BIM model needing data mining. The construction module 701 constructs a BIM model dictionary according to the feature information of the building project actually used extracted from the target BIM model. The BIM model dictionary is a dictionary oriented to a specific BIM model, and contains relevant data of the building field and newly-appearing characteristic information in the building project.

The embedding module 702 is configured to embed a pre-built architectural domain term dictionary and the BIM model dictionary in a pre-trained chinese word segmentation model;

The embodiment adopts a mode of collecting the terms and concepts of the building field and then establishing a dictionary of terms of the building field to optimize the performance of the Chinese word segmentation model. However, the building information field not only has a large number of general terms and concepts of the building field, but also has very personal spatial relationships and construction attribute information in the project model. Therefore, the method and the device not only improve the field adaptability of the Chinese word segmentation model in the field of building information by establishing a term dictionary in the building field, but also improve the model adaptability of the Chinese word segmentation model in the field of building information by establishing a BIM model dictionary. The cross-domain Chinese word segmentation model is not required to be dependent on building a huge professional training corpus, and the application range of the Chinese word segmentation model in other professional fields can be greatly improved.

The word segmentation module 703 is configured to segment the sentence to be segmented in the building domain based on the Chinese word segmentation model embedded in the BIM model dictionary and the building domain term dictionary.

On the basis of the above embodiment, the building module in this embodiment is specifically configured to: screening out the attribute containing the characteristic information of the building project from each IFC object of the target BIM; constructing a project characteristic information model according to the attribute; wherein, the attribute in the project characteristic information model is different in the characteristic information of the building project contained in different IFC objects; and constructing the BIM model dictionary according to the project characteristic information model.

On the basis of the above embodiment, the embedding module in this embodiment is specifically configured to: and embedding a pre-built building domain term dictionary and the BIM model dictionary into a pre-trained Chinese word segmentation model based on a dictionary embedding algorithm of the double-array Trie tree.

On the basis of the above embodiment, the embedding module in this embodiment is further configured to: constructing a Tire tree of the building field term dictionary and a Tire tree of the BIM model dictionary; constructing a double array of the building field term dictionary according to the Tire tree of the building field term dictionary, and constructing a double array of the BIM model dictionary according to the Tire tree of the BIM model dictionary; and adding the double-array of the building field term dictionary and the double-array of the BIM model dictionary into the trained Chinese word segmentation model.

Based on the above embodiment, the word segmentation module in this embodiment is specifically configured to: performing word segmentation on a sentence to be segmented in the building field based on a Chinese word segmentation model embedded in front of the BIM model dictionary and the building field term dictionary to obtain an initial word segmentation result; optimizing the word segmentation result based on the building domain term dictionary embedded in the BIM model dictionary and the building domain term dictionary in the Chinese word segmentation model after the building domain term dictionary, and obtaining an optimized initial word segmentation result; and re-optimizing the optimized initial word segmentation result based on the BIM model dictionary embedded in the Chinese word segmentation model after the BIM model dictionary and the building field term dictionary, and obtaining the re-optimized initial word segmentation result.

Based on the above embodiments, the present embodiment further includes a training module, configured to train the chinese word segmentation model based on an HMM method or a CRF method.

On the basis of the above embodiments, the building module in this embodiment is further configured to: screening out term phrase in the building field from the professional knowledge base in the building field; and constructing the construction field term dictionary according to the term phrase.

Fig. 8 illustrates a physical structure diagram of an electronic device, as shown in fig. 8, which may include: a processor 801, a communication interface (Communications Interface) 802, a memory 803, and a communication bus 804, wherein the processor 801, the communication interface 802, and the memory 803 communicate with each other through the communication bus 804. The processor 801 may call logic instructions in the memory 803 to perform the following method: constructing a BIM model dictionary according to the target BIM model; embedding a pre-built building domain term dictionary and the BIM model dictionary into a pre-trained Chinese word segmentation model; and word segmentation is carried out on the sentences to be segmented in the building field based on the Chinese word segmentation model embedded in the BIM model dictionary and the building field term dictionary.

Further, the logic instructions in the memory 803 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The present embodiment provides a non-transitory computer readable storage medium storing computer instructions that cause a computer to perform the methods provided by the above-described method embodiments, for example, including: constructing a BIM model dictionary according to the target BIM model; embedding a pre-built building domain term dictionary and the BIM model dictionary into a pre-trained Chinese word segmentation model; and word segmentation is carried out on the sentences to be segmented in the building field based on the Chinese word segmentation model embedded in the BIM model dictionary and the building field term dictionary.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The self-adaptive Chinese word segmentation method for the building information model is characterized by comprising the following steps of:

Constructing a BIM model dictionary according to the target BIM model;

Word segmentation is carried out on sentences to be segmented in the building field based on a Chinese word segmentation model embedded in the BIM model dictionary and the building field term dictionary;

The target BIM model is a BIM model which needs to be subjected to data mining, the BIM model dictionary is constructed according to feature information of a building project actually used in the target BIM model, the feature information refers to words which refer to specific components or space areas in the target BIM model, and the BIM model dictionary comprises newly-appearing feature information in the building project;

The step of word segmentation for the sentence to be segmented in the building field based on the Chinese word segmentation model embedded in the BIM model dictionary and the building field term dictionary comprises the following steps:

2. The method for adaptively segmenting chinese words in a building information model according to claim 1, wherein the step of constructing a BIM model dictionary from the target BIM model comprises:

3. The method of claim 1, wherein the step of embedding the pre-built building domain term dictionary and the BIM model dictionary in the pre-trained chinese word segmentation model comprises:

4. The method for adaptively segmenting chinese words in a building information model according to claim 3, wherein the step of embedding a pre-built dictionary of terms in the building domain and the dictionary of BIM models in a pre-trained chinese word segmentation model based on a dictionary embedding algorithm of a double-array Trie comprises:

5. The method of adaptive chinese word segmentation for a building information model according to any one of claims 1-4, further comprising, prior to the step of embedding the pre-built building domain term dictionary and the BIM model dictionary in a pre-trained chinese word segmentation model:

6. The method of adaptive chinese word segmentation for a building information model according to any one of claims 1-4, further comprising, prior to the step of embedding the pre-built building domain term dictionary and the BIM model dictionary in a pre-trained chinese word segmentation model:

7. The utility model provides a chinese word segmentation device of building information model self-adaptation which characterized in that includes:

The word segmentation module is used for segmenting words of sentences to be segmented in the building field based on a Chinese word segmentation model embedded in the BIM model dictionary and the building field term dictionary;

The word segmentation module is specifically used for segmenting words of sentences to be segmented in the building field based on a Chinese word segmentation model embedded in front of the BIM model dictionary and the building field term dictionary, and obtaining an initial word segmentation result;

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs the steps of the building information model adaptive chinese word segmentation method according to any one of claims 1 to 6 when the program is executed.

9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the building information model adaptive chinese word segmentation method according to any one of claims 1 to 6.