CN112632909B - English coding method and device for data object - Google Patents

English coding method and device for data object Download PDF

Info

Publication number
CN112632909B
CN112632909B CN202011191024.1A CN202011191024A CN112632909B CN 112632909 B CN112632909 B CN 112632909B CN 202011191024 A CN202011191024 A CN 202011191024A CN 112632909 B CN112632909 B CN 112632909B
Authority
CN
China
Prior art keywords
word
data object
words
coding
english
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011191024.1A
Other languages
Chinese (zh)
Other versions
CN112632909A (en
Inventor
张冀兰
姚昊
杨加东
郭强
刘华
熊伟
富会佳
肖薇
杨沥铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CNNC Nuclear Power Operation Management Co Ltd
Original Assignee
CNNC Nuclear Power Operation Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CNNC Nuclear Power Operation Management Co Ltd filed Critical CNNC Nuclear Power Operation Management Co Ltd
Priority to CN202011191024.1A priority Critical patent/CN112632909B/en
Publication of CN112632909A publication Critical patent/CN112632909A/en
Application granted granted Critical
Publication of CN112632909B publication Critical patent/CN112632909B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The disclosure belongs to the technical field of nuclear power, and particularly relates to a data object English coding method and device. The dictionary library is divided into a plurality of word libraries which are mutually disjoint, and the word library calling sequence of the data object is determined according to the category of the data object, so that the dictionary library can be flexibly adapted to the compiling characteristics of different data objects, the data heap is compiled more efficiently, the operability is strong, and the coding quality is high. In addition, the knowledge of the expert in the service field is abstracted to form a knowledge base of English coding of the nuclear power data object, so that informatization of knowledge is realized, the purpose of knowledge sharing is achieved, and the quality of English coding is improved; based on dictionary libraries and English coding specifications, the expert dependence on the service field is reduced, and the working efficiency is improved; the unified dictionary and standard English coding specification form a normalized data object English coding, and one word and multiple translation are eliminated.

Description

English coding method and device for data object
Technical Field
The invention belongs to the technical field of nuclear power, and particularly relates to a data object English coding method and device.
Background
Along with the internationalization of nuclear power group management and nuclear power industry in China, a batch of domestic nuclear power production management systems are also under rapid research and development. In the nuclear power informatization project construction process, the data object English codes of the medium and small information systems are usually coded by business personnel of each project contractor, and the data object English codes of the large information systems are coded by business experts in the corresponding field of the constructor. The coding methods of different heap types and different items are not uniform, so that English codes of the same Chinese data object are inconsistent, and data exchange among systems is affected.
The 2010 edition of English-Chinese nuclear power technical dictionary defines 62889 standard English vocabulary entries and 3754 common nuclear power abbreviations in the nuclear power field. In information system design, problems are encountered with using the dictionary directly: (1) Most of the entries in the English-Chinese nuclear power technical dictionary are word groups, and at present, chinese names of a plurality of data objects cannot be directly corresponding to the word groups of the dictionary; (2) inadequate vocabulary. The dictionary also lacks vocabulary entries in the fields of inventory, artificial intelligence, information technology, file management and the like; (3) does not meet information system requirements. The information system depends on the infrastructure IT has specific requirements, such as naming length requirements and special character requirements; and (4) English coding has high requirements on personnel. The number of data objects of an information system can exceed tens of thousands, the coding work from Chinese naming to English naming of the data objects has very high requirements on personnel, the personnel needs to have rich business background and cross-professional skills, and the English naming is completely dependent on the experience of a very small number of business experts, so that the construction progress of the information system is affected. Thus, there is a need for efficient coding methods.
Disclosure of Invention
In order to overcome the problems in the related art, a method and a device for English encoding of a data object are provided.
According to an aspect of the disclosed embodiments, there is provided a method for encoding english of a data object, the method including:
Acquiring a data object to be encoded;
determining a lexicon calling sequence required in the process of encoding the data object according to the class associated with the data object and the corresponding relation between the class and the lexicon calling sequence, wherein the lexicon contents are not repeated;
Word segmentation processing is carried out on the data object to obtain a plurality of words;
and sequentially calling each word stock according to the determined word stock calling sequence to encode the plurality of words until the plurality of words are encoded to form an encoding result.
In one possible implementation manner, calling each word stock in turn according to the determined word stock calling sequence to encode the plurality of words until the plurality of words are encoded to form an encoding result, including:
When the word stock is called each time according to the determined word stock calling sequence, if the uncoded word in the plurality of words is matched with the word stock called the time, the matched word is coded according to the word stock called the time until the plurality of words are coded, and a coding result is formed.
In one possible implementation, the method further includes:
Determining the character length of the coding result;
judging whether the character length of the coding result meets a preset condition or not;
and under the condition that the character length of the coding result does not meet the preset condition, continuously replacing the non-abbreviated word with the longest word of the current plurality of word characters according to the called Chinese-English abbreviation comparison library to form a new coding result until the character length of the new coding result meets the preset condition.
In one possible implementation, the method further includes:
If the category of the data object is judged to be the table name, acquiring the field associated with the data object;
Sequentially calling each word stock to encode the plurality of words according to the determined word stock calling sequence until the plurality of words are encoded to form an encoding result, wherein the method comprises the following steps:
And sequentially calling each word stock according to the determined word stock calling sequence to encode the fields of the plurality of words and the data object until the fields of the plurality of words and the data object are encoded, so as to form an encoding result.
According to another aspect of the embodiments of the present disclosure, there is provided a data object english encoding apparatus, the apparatus including:
the first acquisition module is used for acquiring a data object to be coded;
the first determining module is used for determining a word stock calling sequence required in the process of encoding the data object according to the category associated with the data object and the corresponding relation between the category and the word stock calling sequence, and the contents of all word stocks are not repeated;
The word segmentation module is used for carrying out word segmentation processing on the data object to obtain a plurality of words;
And the coding module is used for sequentially calling each word stock to code the plurality of words according to the determined word stock calling sequence until the plurality of words are coded to form a coding result.
In one possible implementation, the encoding module includes:
And the first coding sub-module is used for coding the matched word according to the called word stock until the plurality of words are coded to form a coding result if the uncoded word in the plurality of words is matched with the called word stock when the word stock is called each time according to the determined word stock calling sequence.
In one possible implementation, the apparatus further includes:
A second determining module, configured to determine a character length of the encoding result;
the judging module is used for judging whether the character length of the coding result accords with a preset condition;
And the reduction module is used for continuously replacing the non-abbreviated word with the longest word of the current plurality of word characters according to the called Chinese-English abbreviation comparison library under the condition that the character length of the coding result does not meet the preset condition, so as to form a new coding result until the character length of the new coding result meets the preset condition.
In one possible implementation, the apparatus further includes:
The second acquisition module is used for acquiring the field associated with the data object under the condition that the category of the data object is judged to be the table name;
The encoding module includes:
And the second coding sub-module is used for sequentially calling each word stock to code the fields of the plurality of words and the data object according to the determined word stock calling sequence until the fields of the plurality of words and the data object are coded to form a coding result.
According to another aspect of the embodiments of the present disclosure, there is provided a data object english encoding apparatus, the apparatus including:
A processor;
A memory for storing processor-executable instructions;
wherein the processor is configured to perform the above-described method.
According to another aspect of the disclosed embodiments, there is provided a non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.
The beneficial effects of the present disclosure are: the dictionary library is divided into a plurality of word libraries which are mutually disjoint, and the word library calling sequence of the data object is determined according to the category of the data object, so that the dictionary library can be flexibly adapted to the compiling characteristics of different data objects, the data heap is compiled more efficiently, the operability is strong, and the coding quality is high. In addition, the knowledge of the expert in the service field is abstracted to form a knowledge base of English coding of the nuclear power data object, so that informatization of knowledge is realized, the purpose of knowledge sharing is achieved, and the quality of English coding is improved; based on dictionary libraries and English coding specifications, the expert dependence on the service field is reduced, and the working efficiency is improved; the unified dictionary and standard English coding specification form a normalized data object English coding, and one word and multiple translation are eliminated.
Drawings
Fig. 1 is a flowchart illustrating a method of english encoding of a data object according to an exemplary embodiment.
Fig. 2 is a flowchart of an application example of a data object english encoding method.
Fig. 3 is a flowchart of an application example of a data object english encoding method.
Fig. 4 is a block diagram illustrating a data object english encoding apparatus according to an exemplary embodiment.
Fig. 5 is a block diagram illustrating a data object english encoding apparatus according to an exemplary embodiment.
Detailed Description
The invention will be described in further detail with reference to the accompanying drawings and specific examples.
Fig. 1 is a flowchart illustrating a method of english encoding of a data object according to an exemplary embodiment. The method may be performed by a terminal device, for example, the terminal device may be a server, a desktop computer, a notebook computer, a tablet computer, or the like, and the terminal device may also be, for example, a user device, a vehicle-mounted device, or a wearable device, or the like. As shown in fig. 1, the method may include:
step 10, obtaining a data object to be encoded;
Step 11, determining a word stock calling sequence required in the process of encoding the data object according to the category associated with the data object and the corresponding relation between the category and the word stock calling sequence, wherein the contents of the word stock are not repeated;
Step 12, word segmentation processing is carried out on the data object to obtain a plurality of words;
And step 13, sequentially calling each word stock to encode the plurality of words according to the determined word stock calling sequence until the plurality of words are encoded to form an encoding result.
In the present disclosure, a plurality of word banks may be preset, where contents of the plurality of word banks are not repeated, and the plurality of word banks may be pre-stored in a terminal device, or may be pre-stored in one or more other terminal devices other than the terminal device, and the terminal device may establish communication connection with the one or more other terminal devices, so that the terminal device may call the plurality of word banks in sequence when compiling is required. The plurality of word stores may include: the method comprises the steps of comparing the english abbreviations in the words of the business field with each other, comparing the english abbreviations in the words of the specific words with each other, comparing the english abbreviations in the words of the field with each other, using the term library, and encoding the common data attribute.
In one possible implementation, the business field word english acronym comparison library may be used to store chinese and english words in the professional field, and the business field word english acronym comparison library may not store general chinese and english words. Each mapping entry is composed of chinese words, english abbreviations, whether standard abbreviations. Each word may or may not have english abbreviations. The data form of the english abbreviation comparison library in the words of the business domain can be shown in table 1
TABLE 1
Chinese word English word English abbreviations Whether or not standard abbreviations
Chemical chemistry Chemistry chem N
Analysis Analysis anls Y
Effluent from a reactor emission emis Y
Sampling Sampling smpl Y
The method for creating the english abbreviation comparison library in the words of the business field may include any one or more of the following:
1. creating a corresponding number of service domain sub-libraries according to 98 service domains of nuclear power;
2. English abbreviation selection Specification
(1) If the English abbreviation of the target word is contained in the English-Chinese abbreviation dictionary, the abbreviation of the English-Chinese abbreviation dictionary is adopted.
(2) If the English abbreviation dictionary of English-Chinese abbreviation does not contain English abbreviations of target words, the 1 st letter corresponding to each consonant in the English word is selected for self-defining English abbreviations. The custom abbreviation does not exceed 5 English letters;
3. English word selection Specification
(1) More commonly used english words are used instead of more precise english words. Such as chinese "procurement", using buy instead of purchase; e.g. "gas" in chinese, gas is used instead of vapor;
(2) Shorter nouns or verbs are used instead of adjectives. Such as chinese "welding", then well is used instead of well;
(3) Chinese coding does not code "form", "list", "item" if the "form", "list", "item" word representing the meaning of the form, list, item, etc. appears and the chinese word length exceeds 2 chinese characters. If the application form is "applied form", the application form is encoded as "request", and the "request sheet" is not used; if "analysis sheet", "analysis item", it is encoded as "analysis", it does not use "analysis order";
In one possible implementation, a library of word-specific abbreviations may be used to store abbreviations for the presence of specific, high frequency words in the data attributes. In order to reduce the length of English codes and improve the efficiency of data interaction, special codes are carried out for specific words. Where a particular word may be represented as a word that only appears to be the first or last named in the data attribute chinese. The specific word abbreviation library is stored in the form of Chinese words, english translations and English abbreviation triples. The data form of the library of specific word abbreviations may be as shown in table 2,
TABLE 2
Chinese word English translation English abbreviations
Coding/coding Code c
Numbering device Identification id
Status of Status s
Sequence number Sequence seq
Description of the invention Description des
Title of the book Title ti
Date of day Date d
Time of Datetime dt
Whether or not (i.e. Boolean type) Boolean b
The creation specification of a library of specific word abbreviations may include any one or more of the following:
1. the total number of the databases is not more than 30
2. The length of English abbreviation is preferably 1 letter, and the maximum length is not more than 3 letters
3. The english abbreviations should be the initials or combinations of english words
4. English abbreviations have to be repeated
In one possible implementation, the universal field word in-english abbreviation comparison library may be used to store abbreviations of common english words other than "business field word in-english abbreviation comparison library", "specific word abbreviation library", rather than abbreviations of english phrases. The english abbreviation comparison library in the general field words is stored by means of binary entry of english words and english word abbreviations, and the data format of the english abbreviation comparison library in the general field words may be as shown in table 3:
TABLE 3 Table 3
English word English word abbreviations
scale scal
schema schm
scope scp
screen scrn
The creation specification of the english abbreviation comparison library in the general field words may include 2635 words and abbreviations used for selecting the nuclear power business data based on english-chinese abbreviation dictionary.
In one possible implementation, the core terms or common and fixed chinese phrases may be english-coded and form a term library. The term library is stored in the form of Chinese terms, english translations, english term abbreviations and triples. English terms are abbreviated as the first letter of each word encoded by the english term. The term library refers to the annex nuclear electricity common abbreviations of English-Chinese nuclear power technical dictionary of 2010 edition and expands according to the service.
The data form of the term library may be as shown in table 4,
TABLE 4 Table 4
Chinese terminology English translation English term abbreviations
Corrective action corrective action ca
Quality defect reporting quality deficiency report qdr
Non-compliance item reporting non-conformance report ncr
Technical specification book technical specification ts
Technical specification technical specification ts
The term library creation specification may include any one or more of the following:
1. The length of the abbreviation of English term is not more than 4 letters
2. Chinese terms of the same meaning but named use their respective corresponding english terms abbreviations
3. The repetition rate of the english term abbreviations is not more than 2%, that is, when at most 2 abbreviations in 100 english term abbreviations are identical 4 and chinese phrase is number + adverb, the english term abbreviations are number + adverb abbreviations.
In one possible implementation, a business field code library may be used to store chinese names and english codes for all business fields of the nuclear power field. The library is stored in the form of Chinese names, english translations, english code triplets. The data form of the service area code base may be as shown in table 5:
TABLE 5
Chinese name English name English code
Test worksheet Commissioning work order cw
Debugging work package Commissioning work package cp
Radiation management Radiation Management rm
Work application Work request wr
Quality planning quality plan qp
Nuclear safety nuclear safety ns
Workflow configuration Workflow configuration wc
Chemical management Chemical Management cm
The business segment code base creation specification may include any one or more of the following:
1. The length of English codes in the service field is fixed to be 2 bits;
2. the English code in the service field is the first letter of the first 2 words of the English name;
3. The English code in the service field does not allow repetition, and if the combination of the first letters of the first 2 words is repeated, the first letters of the 1 st word and the 3 rd word are used; if repeated, the first letter of the 1 st word and the end letter of the 2 nd word are used.
In one possible implementation manner, the common data attribute coding library can store words commonly appearing in the data heap type, so as to unify the coding of the same data attribute among different data entities and facilitate the data exchange among different systems, thereby establishing the common data attribute English coding library. The library is stored in the form of Chinese names and English coded two-tuple. The data form of the common data attribute coding library can be as shown in table 6:
TABLE 6
Chinese name English coding
Remarks Memo
Updating person Update_by
Update time Update_dt
Creator person Create_by
Creation time Create_dt
The creation specification of the common data attribute coding library may include: 1. data attributes with occurrence frequency exceeding 2% are allowed to be included in the code library; 2. english coding accords with the English coding specification of data objects of the patent "
As an example of this embodiment, in step 10, the terminal device may obtain the encoded data object from one or more pre-stored data objects to be encoded, or from other devices or systems when encoding is required.
In step 11, each data object may comprise a pre-associated class, which may for example comprise a table name and a table attribute. Different classes may correspond to different lexicon invocation orders.
In step 12, the data object may be subjected to word segmentation by using a word segmentation technique to obtain a plurality of words, and it should be noted that any applicable word segmentation technique may be selected to perform word segmentation on the data object.
In step 13, when the terminal device calls the word stock each time according to the determined word stock call sequence, if the uncoded word in the plurality of words is matched with the word stock called for the time, the matched word is coded according to the word stock called for the time until the plurality of words are coded, and a coding result is formed.
In one possible implementation manner, the data object may also be associated with a domain (for example, a chemical domain) in advance, and the terminal device may acquire the domain associated with the data object if it determines that the class of the data object is a table name;
Sequentially calling each word stock to encode the plurality of words according to the determined word stock calling sequence until the plurality of words are encoded to form an encoding result, wherein the method comprises the following steps:
And sequentially calling each word stock according to the determined word stock calling sequence to encode the fields of the plurality of words and the data object until the fields of the plurality of words and the data object are encoded, so as to form an encoding result.
In one possible implementation, after forming the encoding result, the terminal device may also determine a character length of the encoding result; judging whether the character length of the coding result meets a preset condition or not; and under the condition that the character length of the coding result does not meet the preset condition, continuously replacing the non-abbreviated word with the longest word of the current plurality of word characters according to the called Chinese-English abbreviation comparison library to form a new coding result until the character length of the new coding result meets the preset condition.
In one possible implementation, the method may further include:
If the category of the data object is judged to be the table name, acquiring the field associated with the data object;
step 13 may further include: and sequentially calling each word stock according to the determined word stock calling sequence to encode the fields of the plurality of words and the data object until the fields of the plurality of words and the data object are encoded, so as to form an encoding result.
In one possible implementation, the terminal device may process the obtained code according to a data object english coding specification, to obtain a coding result, where the data object english coding specification may include the following multiple items.
1. The character is composed of Arabic numerals, english letters and underlines, and the first character or the last character of English codes can only be letters;
2. Using lowercase english alphabetic codes;
3. the words are connected by underlines;
4. the encoding result is not more than 30 characters;
5. when the data entity is coded, if the requirements of 3 and 4 points are met, combining English words to form English codes, otherwise, using English abbreviation combination to form English codes;
6. the English coding of the data entity comprises the following composition modes: business field code + English code;
7. In the case of english encoding of data attributes, some data attributes (table 2) are often present in many data entities, and the english encoding rule of such data attributes is: data entity encoding (no service area code included) +english abbreviations of table 2, using underlined connections;
fig. 2 is a flowchart of an application example of a data object english encoding method. As shown in fig. 2, the method may include:
S101: starting. A data object is obtained, which is a table name, chinese denominated "chemical sampling plan-seawater monitoring analysis item".
S102: and obtaining the domain code according to the domain to which the data object belongs and the service domain code base. Each data object has a unique home domain of business. The data object belongs to the service field of chemical management, and obtains the service field coding cm according to the service field code library;
s103: according to the "term library", the data object is checked and if there is a term in the data object, it is directly encoded. The term "chemical sampling plan" exists in this data object, from which the encoded csp is derived.
S104: the Chinese name of the data object is segmented. Chinese word segmentation is carried out on the rest Chinese part of the data object to obtain Chinese words: seawater, monitoring and analyzing items.
S105: and mapping and coding the Chinese words according to a sub-library (such as chemical management) of an English abbreviation comparison library in the word in the service field. The data object belongs to the field of chemical management, and the Chinese words in the step S104 are respectively inquired in a chemical management sub-library of an English abbreviation comparison library in the word of the service field to obtain the corresponding English word seawater of seawater.
S106: the Chinese words are mapped according to the "universal field word english abbreviation comparison library". And (5) encoding other unencoded Chinese words in the step (S105) to obtain the corresponding English words "monitor" and the corresponding English words "analysis" of the analysis item.
S107: and generating codes according to the 3 rd rule of the data object English coding specification to obtain cm_csp_ seawater _monitor_analysis.
S108: the code length is checked. For example, the English words are replaced by abbreviations and code lengths are checked again one by using a "English abbreviation comparison library in the general field words" and a "English abbreviation comparison library in the business field words". The code length was 33 characters, exceeding the specification requirements, and in the above two control libraries, the abbreviations seawater and monitor were not found, and the abbreviation anal of analysis was found. After replacement, a new code "cm_csp_ seawater _monitor_ anls" is obtained. The new code length is checked again, and the length of the new code length is 29 characters, which accords with the English code specification of the data object.
S109: and checking S108 the output code length according to the 4 th rule of the English code specification of the data object. If the word is still very long, creating abbreviations for non-abbreviation English words according to a Chinese-English comparison dictionary library and a creation specification, and supplementing the abbreviations to a general field word English abbreviation comparison library and a business field word English abbreviation comparison library.
Fig. 3 is a flowchart of an application example of a data object english encoding method. As shown in fig. 3, the method may include:
S201: checking data attribute according to the common data attribute coding library, if the data attribute belongs to the library, directly coding. In this embodiment, the data attribute "creator" belongs to the library, and its english encoding is "create_by";
s202: according to the term library, the data attributes are checked and if there is a term in the data attributes, the code is directly encoded. In this embodiment, the data attribute "parameter unit" belongs to the library, and its english encoding is "uom";
S203: and (5) word segmentation is carried out on the data attribute Chinese names. In this embodiment, the result after chinese word segmentation is implemented is as follows:
Material code (Material, code)
Parameter Unit- > uom
Chemical class (chemical, class)
Chemical technical approval number (chemical, technical approval, number)
Creator- > create_by
S204: and mapping and coding the Chinese words according to the 7 th rule and the 4 specific word abbreviation library of the data object English coding specification. In this example, "code", "number" belongs to the library, and the result after implementation is as follows:
material coding (Material- >, coding- > c)
Parameter Unit- > uom
Chemical class (chemical, class)
Chemical technical approval number (chemical, technical approval, number- > id)
Creator- > create_by
S206: and mapping and coding the Chinese words according to the sub-library of the English abbreviation comparison library in the service field words. In this example, the results of the encoding of the Chinese word "chemical" using the "chemical management" sub-library are as follows:
material coding (Material- >, coding- > c)
Parameter Unit- > uom
Chemical class (chemical, class)
Chemical technical approval number (chemical- > chemical, technical approval, number- > id)
Creator- > create_by
S207: the chinese words are mapped according to "4. English abbreviation control library in generic field words". In this embodiment, the results of encoding the Chinese words "materials", "technique", "approval" are as follows:
material coding (Material- > Material, coding- > c)
Parameter Unit- > uom
Chemical class (chemical, class- > type)
Chemical technical approval number (chemical- > chemical, technical- > approval- > examine, number- > id)
Creator- > create_by
S208: and generating codes according to the 3 rd rule of the data object English coding rule. In this embodiment, the result after encoding is as follows:
Material coding- > Material_c
Parameter Unit- > uom
Chemical class- > chemical type
Chemical technical approval number- > chemical_technology_ examine _id
Creator- > create_by
S209: and checking the coding length according to the 4 th rule of the English coding specification of the data object. For example, the English words are replaced by abbreviations and the lengths are checked again one by using a "English abbreviation comparison library in the general field words" and a "English abbreviation comparison library in the business field words". In this embodiment, the code length of "chemical_technology_ examine _id" is 31, which exceeds 30 characters required by the rule, and the result after recoding the first word according to the comparison library is as follows:
Material coding- > Material_c
Parameter Unit- > uom
Chemical class- > chemical type
Chemical technical approval number- > chem_technology_ examine _id
Creator- > create_by
S210: and checking S108 the output code length according to the 4 th rule of the English code specification of the data object. If still very long, the abbreviations are created for non-abbreviated English words as specified in "English abbreviation selection Specification" clause 2. In the present embodiment, this scene does not exist.
Fig. 4 is a block diagram illustrating a data object english encoding apparatus according to an exemplary embodiment. As shown in fig. 4, the apparatus may include:
A first acquisition module 40, configured to acquire a data object to be encoded;
A first determining module 41, configured to determine a lexicon calling order required in the process of encoding the data object according to the class associated with the data object and a correspondence between the class and the lexicon calling order;
a word segmentation module 42, configured to perform word segmentation processing on the data object to obtain a plurality of words;
And the encoding module 43 is configured to sequentially call each word stock according to the determined word stock call sequence to encode the plurality of words until the plurality of words are encoded to form an encoding result.
In one possible implementation, the encoding module includes:
And the first coding sub-module is used for coding the matched word according to the called word stock until the plurality of words are coded to form a coding result if the uncoded word in the plurality of words is matched with the called word stock when the word stock is called each time according to the determined word stock calling sequence.
In one possible implementation, the apparatus further includes:
A second determining module, configured to determine a character length of the encoding result;
the judging module is used for judging whether the character length of the coding result accords with a preset condition;
And the reduction module is used for continuously replacing the non-abbreviated word with the longest word of the current plurality of word characters according to the called Chinese-English abbreviation comparison library under the condition that the character length of the coding result does not meet the preset condition, so as to form a new coding result until the character length of the new coding result meets the preset condition.
In one possible implementation, the apparatus further includes:
The second acquisition module is used for acquiring the field associated with the data object under the condition that the category of the data object is judged to be the table name;
The encoding module includes:
And the second coding sub-module is used for sequentially calling each word stock to code the fields of the plurality of words and the data object according to the determined word stock calling sequence until the fields of the plurality of words and the data object are coded to form a coding result.
It should be noted that the description of the above apparatus has been described in detail in the description of the method above, and will not be repeated here.
Fig. 5 is a block diagram illustrating a data object english encoding apparatus according to an exemplary embodiment. For example, the apparatus 1900 may be provided as a server. Referring to fig. 5, the apparatus 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that are executable by the processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.
The apparatus 1900 may further include a power component 1926 configured to perform power management of the apparatus 1900, a wired or wireless network interface 1950 configured to connect the apparatus 1900 to a network, and an input/output (I/O) interface 1958. The device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of apparatus 1900 to perform the above-described methods.
The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
The computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvement of the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (10)

1. A method for encoding a data object in english, the method comprising:
Acquiring a data object to be encoded;
determining a lexicon calling sequence required in the process of encoding the data object according to the class associated with the data object and the corresponding relation between the class and the lexicon calling sequence, wherein the lexicon contents are not repeated;
Word segmentation processing is carried out on the data object to obtain a plurality of words;
sequentially calling each word stock according to the determined word stock calling sequence to encode the plurality of words until the plurality of words are encoded to form an encoding result;
Different classes correspond to different word stock calling orders, each word stock comprising: the method comprises the steps of comparing a word in the service field with an acronym, comparing the word in the service field with a acronym in a specific word, comparing the word in the service field with the acronym in the specific word, comparing the word in the service field with a term base, and storing a common data attribute coding base, wherein the common data attribute coding base is used for storing data attributes with the occurrence frequency of more than 2% and the coding rules accord with preset coding rules.
2. The method of claim 1, wherein sequentially invoking each word stock in the determined word stock invocation order to encode the plurality of words until the plurality of words are encoded to form an encoded result comprises:
When the word stock is called each time according to the determined word stock calling sequence, if the uncoded word in the plurality of words is matched with the word stock called the time, the matched word is coded according to the word stock called the time until the plurality of words are coded, and a coding result is formed.
3. The method according to claim 1, wherein the method further comprises:
Determining the character length of the coding result;
judging whether the character length of the coding result meets a preset condition or not;
and under the condition that the character length of the coding result does not meet the preset condition, continuously replacing the non-abbreviated word with the longest word of the current plurality of word characters according to the called Chinese-English abbreviation comparison library to form a new coding result until the character length of the new coding result meets the preset condition.
4. The method according to claim 1, wherein the method further comprises:
If the category of the data object is judged to be the table name, acquiring the field associated with the data object;
Sequentially calling each word stock to encode the plurality of words according to the determined word stock calling sequence until the plurality of words are encoded to form an encoding result, wherein the method comprises the following steps:
And sequentially calling each word stock according to the determined word stock calling sequence to encode the fields of the plurality of words and the data object until the fields of the plurality of words and the data object are encoded, so as to form an encoding result.
5. A data object english encoding apparatus, the apparatus comprising:
the first acquisition module is used for acquiring a data object to be coded;
the first determining module is used for determining a word stock calling sequence required in the process of encoding the data object according to the category associated with the data object and the corresponding relation between the category and the word stock calling sequence, and the contents of all word stocks are not repeated;
The word segmentation module is used for carrying out word segmentation processing on the data object to obtain a plurality of words;
the coding module is used for sequentially calling each word stock to code the plurality of words according to the determined word stock calling sequence until the plurality of words are coded to form a coding result;
Different classes correspond to different word stock calling orders, each word stock comprising: the method comprises the steps of comparing a word in the service field with an acronym, comparing the word in the service field with a acronym in a specific word, comparing the word in the service field with the acronym in the specific word, comparing the word in the service field with a term base, and storing a common data attribute coding base, wherein the common data attribute coding base is used for storing data attributes with the occurrence frequency of more than 2% and the coding rules accord with preset coding rules.
6. The apparatus of claim 5, wherein the encoding module comprises:
And the first coding sub-module is used for coding the matched word according to the called word stock until the plurality of words are coded to form a coding result if the uncoded word in the plurality of words is matched with the called word stock when the word stock is called each time according to the determined word stock calling sequence.
7. The apparatus of claim 5, wherein the apparatus further comprises:
A second determining module, configured to determine a character length of the encoding result;
the judging module is used for judging whether the character length of the coding result accords with a preset condition;
And the reduction module is used for continuously replacing the non-abbreviated word with the longest word of the current plurality of word characters according to the called Chinese-English abbreviation comparison library under the condition that the character length of the coding result does not meet the preset condition, so as to form a new coding result until the character length of the new coding result meets the preset condition.
8. The apparatus of claim 5, wherein the apparatus further comprises:
The second acquisition module is used for acquiring the field associated with the data object under the condition that the category of the data object is judged to be the table name;
The encoding module includes:
And the second coding sub-module is used for sequentially calling each word stock to code the fields of the plurality of words and the data object according to the determined word stock calling sequence until the fields of the plurality of words and the data object are coded to form a coding result.
9. A data object english encoding apparatus, the apparatus comprising:
A processor;
A memory for storing processor-executable instructions;
Wherein the processor is configured to perform the method of any one of claims 1 to 4.
10. A non-transitory computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 4.
CN202011191024.1A 2020-10-30 2020-10-30 English coding method and device for data object Active CN112632909B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011191024.1A CN112632909B (en) 2020-10-30 2020-10-30 English coding method and device for data object

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011191024.1A CN112632909B (en) 2020-10-30 2020-10-30 English coding method and device for data object

Publications (2)

Publication Number Publication Date
CN112632909A CN112632909A (en) 2021-04-09
CN112632909B true CN112632909B (en) 2024-06-11

Family

ID=75303191

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011191024.1A Active CN112632909B (en) 2020-10-30 2020-10-30 English coding method and device for data object

Country Status (1)

Country Link
CN (1) CN112632909B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112884093B (en) * 2021-04-29 2021-08-31 四川大学 Rotary machine fault diagnosis method and equipment based on DSCRN model and storage medium
CN113536737A (en) * 2021-07-19 2021-10-22 北京数码大方科技股份有限公司 Material code generation method and device and electronic equipment
CN113946660B (en) * 2021-12-21 2022-03-15 卡斯柯信号(北京)有限公司 Automatic Boolean variable checking method and system for offline data of train control center
CN115098476A (en) * 2022-06-23 2022-09-23 中核核电运行管理有限公司 Data cleaning method and device for integrating production data of nuclear power station with multiple sources

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002007104A (en) * 2000-06-22 2002-01-11 Mitsubishi Electric Corp Character data compressing and displaying device
US7610192B1 (en) * 2006-03-22 2009-10-27 Patrick William Jamieson Process and system for high precision coding of free text documents against a standard lexicon
CN102508824A (en) * 2011-09-29 2012-06-20 苏州大学 Compression coding and decoding method and device for microblog information
CN102880703A (en) * 2012-09-25 2013-01-16 广州市动景计算机科技有限公司 Methods and systems for encoding and decoding Chinese webpage data
CN103646017A (en) * 2013-12-11 2014-03-19 南京大学 Acronym generating system for naming and working method thereof
CN105069124A (en) * 2015-08-13 2015-11-18 易保互联医疗信息科技(北京)有限公司 Automatic ICD (International Classification of Diseases) coding method and system
CN108108365A (en) * 2016-11-25 2018-06-01 核工业北京地质研究院 A kind of classification and coding method suitable for Nuclear waste disposal multi-source information management
CN110362542A (en) * 2019-07-15 2019-10-22 岭澳核电有限公司 Nuclear power station document No. method, apparatus, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011075762A1 (en) * 2009-12-22 2011-06-30 Health Ewords Pty Ltd Method and system for classification of clinical information

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002007104A (en) * 2000-06-22 2002-01-11 Mitsubishi Electric Corp Character data compressing and displaying device
US7610192B1 (en) * 2006-03-22 2009-10-27 Patrick William Jamieson Process and system for high precision coding of free text documents against a standard lexicon
CN102508824A (en) * 2011-09-29 2012-06-20 苏州大学 Compression coding and decoding method and device for microblog information
CN102880703A (en) * 2012-09-25 2013-01-16 广州市动景计算机科技有限公司 Methods and systems for encoding and decoding Chinese webpage data
CN103646017A (en) * 2013-12-11 2014-03-19 南京大学 Acronym generating system for naming and working method thereof
CN105069124A (en) * 2015-08-13 2015-11-18 易保互联医疗信息科技(北京)有限公司 Automatic ICD (International Classification of Diseases) coding method and system
CN108108365A (en) * 2016-11-25 2018-06-01 核工业北京地质研究院 A kind of classification and coding method suitable for Nuclear waste disposal multi-source information management
CN110362542A (en) * 2019-07-15 2019-10-22 岭澳核电有限公司 Nuclear power station document No. method, apparatus, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
知识组织体系互操作中的缩略语语义控制与规范;邓盼盼;孙海霞;;中华医学图书情报杂志;20200115(第01期);16-25 *

Also Published As

Publication number Publication date
CN112632909A (en) 2021-04-09

Similar Documents

Publication Publication Date Title
CN112632909B (en) English coding method and device for data object
US20190294678A1 (en) Systems and method for vocabulary management in a natural learning framework
US20170199810A1 (en) Automatic Cognitive Adaptation of Development Assets According to Requirement Changes
CN111061833A (en) Data processing method and device, electronic equipment and computer readable storage medium
CN111177231A (en) Report generation method and report generation device
US20220092095A1 (en) Iterative application of a machine learning-based information extraction model to documents having unstructured text data
US20190236155A1 (en) Feedback for a conversational agent
CN111125064B (en) Method and device for generating database schema definition statement
CN112015562A (en) Resource allocation method and device based on transfer learning and electronic equipment
CN111475196B (en) Compiling alarm tracing method and device, electronic equipment and computer readable medium
US20230087421A1 (en) Systems and methods for generalized structured data discovery utilizing contextual metadata disambiguation via machine learning techniques
US20220284371A1 (en) Method, device and medium for a business function page
CN110837356A (en) Data processing method and device
CN111190905A (en) Database table processing method and device and electronic equipment
CN114064925A (en) Knowledge graph construction method, data query method, device, equipment and medium
CN117314139A (en) Modeling method and device for business process, terminal equipment and storage medium
WO2020146784A1 (en) Converting unstructured technical reports to structured technical reports using machine learning
CN116187353A (en) Translation method, translation device, computer equipment and storage medium thereof
CN110764768A (en) Method and device for mutual conversion between model object and JSON object
CN115470790A (en) Method and device for identifying named entities in file
KR102308521B1 (en) Method and device for updating information
US11562121B2 (en) AI driven content correction built on personas
CN113138760A (en) Page generation method and device, electronic equipment and medium
CN114218914A (en) Service matching method and related device
CN114547321A (en) Knowledge graph-based answer generation method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant