CN115204190A

CN115204190A - Device and method for converting financial field terms into English

Info

Publication number: CN115204190A
Application number: CN202211107345.8A
Authority: CN
Inventors: 韩双江; 姜长江; 苏雯斐; 王辉; 黄荣辉; 战启铭
Original assignee: Sino Credit Information Technology Beijing Co ltd
Current assignee: Sino Credit Information Technology Beijing Co ltd
Priority date: 2022-09-13
Filing date: 2022-09-13
Publication date: 2022-10-18
Anticipated expiration: 2042-09-13
Also published as: CN115204190B

Abstract

The invention discloses a device for converting terms in the financial field into English, which comprises: the term bank preprocessing module is used for loading a term bank, and financial field terms are preset in the term bank; the Chinese semantic analysis word segmentation module is used for acquiring a sentence to be processed and splitting the sentence into a plurality of words and/or words according to financial domain terms in the sentence sequence; the Chinese-to-English conversion module is used for converting the split words and/or expressions into English terms; and the result output module is used for outputting the sentence to be processed and the corresponding English term. The method has the beneficial effects of accurately analyzing the semantics of the financial field and accurately converting the semantics into English terms. The invention provides a method for converting financial field terms into English, which is accurate in semantic analysis of the financial field and accurate in English term conversion.

Description

Device and method for converting financial field terms into English

Technical Field

The invention relates to the technical field of semantic analysis of terms in the financial field. More particularly, the present invention relates to a device and method for converting financial domain terminology into English.

Background

The existing devices in the industry at present comprise ending Word segmentation, a HanLP Chinese language processing packet, jcseg lightweight Java Chinese Word segmentation, sego Chinese Word segmentation, foolNLTK Chinese Word segmentation, ansj Chinese Word segmentation, word segmentation and the like, but most of the technologies are based on Chinese semantics to split, the split Word groups are more compound Chinese semantic Word groups, but the splitting and the use of financial terms are not satisfied, and meanwhile, the tool only performs Word segmentation and does not perform conversion of Chinese and English terms. In addition, the technology in the industry also comprises a tool which is developed by staff of Teradata and based on Microsoft Excel processing, the tool is convenient and fast, and can meet the requirements of financial term splitting and Chinese-English conversion, but the tool is processed based on Microsoft Excel by adopting VB language, the Excel is designed to have the copyright problem in most scenes, and a plurality of systems cannot be installed.

Disclosure of Invention

An object of the present invention is to solve at least the above problems and to provide at least the advantages described later.

To achieve these objects and other advantages in accordance with the purpose of the invention, there is provided an apparatus for converting terminology of the financial field into english, comprising:

the term bank preprocessing module is used for loading a term bank, and financial field terms are preset in the term bank;

the Chinese semantic analysis word segmentation module is used for acquiring a sentence to be processed and splitting the sentence into a plurality of words and/or words according to financial field terms in the sentence sequence;

the Chinese-to-English conversion module is used for converting the split words and/or expressions into corresponding English terms;

the method for splitting the sentence by the Chinese semantic analysis participle module comprises the following steps of:

the method comprises the steps of firstly, splitting sentences one by one according to Chinese words, calculating the number n of the Chinese words, forming An ordered set An = (A1, A2, am.. An) by the split Chinese words according to the sentence sequence, and setting the number n of the Chinese words as iteration cycle times;

step two, semantic analysis: taking out a first element A1 in the ordered set An, comparing the first element A1 with the word stock, recording the element if the first element A1 is hit, mapping a corresponding English term, performing cumulative combination on the element and the rest elements in the set An to form a new split vocabulary, comparing the new split vocabulary with the word stock, combining the hit element and the vocabulary to form An ordered set Bn, and iteratively creating a new vocabulary for multiple times until the cumulative combination of the A1 and the rest n-1 elements is finished;

step three, reading the minimum unit semantic word with the longest length in the set Bn as a first splitting term word, wherein the length of the first splitting term word is m, starting 2 nd iteration by using an A (m + 1) word, reading the minimum unit semantic word with the longest length from the 2 nd round Bn set as a second hit splitting term word, mapping corresponding English terms, repeating for multiple times until m = n is completed, completing semantic splitting, and forming a splitting vocabulary;

and the result output module is used for outputting the sentences to be processed and the corresponding English terms.

Preferably, the thesaurus comprises a built-in thesaurus and a custom thesaurus, the built-in thesaurus comprises a plurality of financial field terms, the custom thesaurus is used for adding new terms, and the priority of the custom thesaurus is higher than that of the built-in thesaurus.

Preferably, if any vocabulary is missed when comparing with the word stock in the second step and the third step, the longest semantic word with the longest a length accumulated by the combination of the elements A1 is taken as a split word, and the mapped english term is represented in the form of a placeholder.

Preferably, the system further comprises a target semantic integration module, which is used for splicing two split adjacent words and/or phrases by using preset symbols, and is used for splicing corresponding adjacent english terms by using preset symbols.

Preferably, the target semantic integration module is configured to provide that english terms corresponding to the sentence are displayed in a humped form, a full upper case form and a full lower case form.

Preferably, the result output module outputs the result in one of a console mode and an Excel file.

A method for converting the term of the financial field into English is provided, which comprises the following steps:

step one, obtaining a sentence to be processed, and splitting the sentence into a plurality of words and/or phrases according to financial field terms in a sentence sequence, wherein a word stock containing a plurality of financial field terms is preset;

the method for splitting the statement comprises the following steps:

step a, splitting sentences one by one according to Chinese words, calculating the number n of the Chinese words, forming An ordered set An = (A1, A2, am.. An) by the split Chinese words according to the sentence sequence, and setting the number n of the Chinese words as the iteration cycle times;

step b, semantic analysis: taking out a first element A1 in the ordered set An, comparing the first element A1 with the word stock, recording the element if the first element A1 is hit, mapping a corresponding English term, performing cumulative combination on the element and the rest elements in the set An to form a new split vocabulary, comparing the new split vocabulary with the word stock, combining the hit element and the vocabulary to form An ordered set Bn, and iteratively creating a new vocabulary for multiple times until the cumulative combination of the A1 and the rest n-1 elements is finished;

step c, reading the minimum unit semantic word with the longest length in the set Bn as a first splitting term, wherein the length of the first splitting term is m, starting iteration in the 2 nd round by using an A (m + 1) word, reading the minimum unit semantic word with the longest length from the set Bn in the 2 nd round as a second hit splitting term, mapping corresponding English terms, iterating for multiple times until m = n, and ending circulation to complete semantic splitting to form a split vocabulary;

step two, converting the split words and/or expressions into English terms;

and step three, outputting the sentences to be processed and the corresponding English terms.

The invention at least comprises the following beneficial effects: the word segmentation algorithm of the device is completely designed for the company, and can independently and independently meet the requirement of term splitting in the financial field. The device can realize the splitting of Chinese semantic analysis words, can also realize the conversion of Chinese terms and English terms based on the built-in word stock and the self-defined word stock, outputs three English term vocabularies of hump type, full capitalization and full lowercase in one step according to the conversion result, is convenient to use and does not need secondary processing.

The Windows operating system can be operated on, or can be operated in a Linux or Unix operating system, and the requirement on the environment is low.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.

Drawings

FIG. 1 is a block diagram of the apparatus according to one embodiment of the present invention;

FIG. 2 is a flow chart of semantic splitting according to one embodiment of the present invention;

FIG. 3 is a diagram of an example of the Chinese semantic analysis participle module analysis of the present invention;

FIG. 4 is a screenshot of a user interface of one of the embodiments of the present invention;

FIG. 5 is an example of EXCEL output according to one embodiment of the present invention.

Detailed Description

The present invention is further described in detail below with reference to the attached drawings so that those skilled in the art can implement the invention by referring to the description text.

It is to be understood that the terminology indicating the positions or positional relationships is based on the positions or positional relationships shown in the drawings, and is for the purpose of convenience in describing the invention and simplifying the description, and does not indicate or imply that the device or element being referred to must have a particular orientation, configuration and operation in a particular orientation, and therefore, should not be taken as limiting the invention.

As shown in fig. 1 to 5, the present invention provides a device for converting a term in the financial field into english, comprising:

the term bank preprocessing module is used for loading a term bank, and financial field terms are preset in the term bank; the term bank preprocessing module is internally provided with two term processing mechanisms, the built-in word bank comprises common terms of the financial field summarized in the industry in recent years, and the custom word bank can modify the terms of the built-in word bank or add new terms according to the needs. After the device is started, the built-in word bank is loaded through the term bank preprocessing module, then the user-defined word bank is loaded, if the same terms as the built-in word bank exist in the loaded user-defined word bank, the terms of the user-defined word bank cover the built-in word bank, and the term effective life cycle of the user-defined word bank is within the device starting cycle and cannot permanently cover the built-in word bank.

The Chinese semantic analysis word segmentation module is used for acquiring a sentence to be processed and splitting the sentence into a plurality of words and/or words according to financial field terms in the sentence sequence; after the loading of the term bank preprocessing module is finished, the Chinese semantic analysis word segmentation module is started, the module receives sentences to be processed, after the sentences are successfully received, a plurality of words with the longest length of fusion domain terms in the term bank are compared and hit according to the sequence of the sentences, the sentences are split according to the words, and placeholders are uniformly adopted for representing missed words. This completes the splitting of the sentence according to the term of the financial field.

The Chinese-to-English conversion module is used for converting the split words and/or expressions into English terms; and converting the split vocabulary corresponding to the sentence into English terms.

And the result output module is used for outputting the sentence to be processed and the corresponding English term. And outputting the single statement in a console standard mode, and outputting the batch statements in an Excel file mode.

In the above technical solution, the present device mainly handles 2 main functions: 1. and secondly, converting the split Chinese words into English terms. In order to realize the two main functions, the device mainly comprises 4 processing modules which are respectively a term bank preprocessing module, a Chinese semantic analysis word segmentation module, a Chinese to English conversion module and a result output module. The device is designed and developed by adopting Java language, the deployment environment JDK is open source and does not relate to the copyright problem, and in addition, the Java language has the characteristic of compiling and running at multiple places at one time, so that all running environments in the market can be supported, namely the Java language can run on a Windows operating system and can also run in a Linux or Unix operating system, and the requirement on the environment is low. The word segmentation algorithm of the device is designed for the company to be completely independent and independent, and can meet the requirement of term splitting in the financial field. The device can realize the word splitting of Chinese semantic analysis, can also realize the conversion of Chinese and English terms based on the built-in semantic library and the self-defined semantic library, and outputs three English term vocabularies of hump, full capitalization and full lowercase in one step according to the conversion result, thereby being convenient to use and not needing secondary processing.

In another technical scheme, the method for splitting the sentence by the Chinese semantic analysis participle module comprises the following steps:

the method comprises the steps of firstly, splitting sentences one by one according to Chinese words, calculating the number n of the Chinese words, forming An ordered set An = (A1, A2, am.. An) by the split Chinese words according to the sentence sequence, and setting the number n of the Chinese words as iteration cycle times; after the iteration counter is successfully set, semantic analysis is started, a hit minimum unit semantic word set Bn of the iteration word is created in each semantic analysis, and all hit split words and mapping English terms of the iteration word are recorded by the Bn.

Step two, semantic analysis: taking out a first element A1 in the ordered set An, comparing the first element A1 with the word stock, recording the element if the element is hit, recording the element in a set Bn, mapping corresponding English terms, then performing cumulative combination on the element and the rest elements Am in the set An to form a new split vocabulary, comparing the new split vocabulary with the word stock, combining the hit element and the vocabulary to form the ordered set Bn, and iteratively creating a new vocabulary for multiple times until the cumulative combination of the A1 and the rest n-1 elements is completed; and after the element A1 is iterated, reading the minimum unit semantic word with the longest length from the Bn as a splitting term word.

And step three, after the element A1 iteration is completed, reading the length m of the hit split word of the element A1, starting the 2 nd iteration from the m word, and reading the longest semantic word from the 2 nd iteration Bn set to be used as the second hit split term word. And (5) repeating multiple iterations until m = n, ending the loop, and completing semantic splitting. Reading the minimum unit semantic word with the longest length in the set Bn as a first splitting term word, wherein the length of the first splitting term word is m, starting iteration in the 2 nd round by using an A (m + 1) word, reading the minimum unit semantic word with the longest length from the set Bn in the 2 nd round as a second hit splitting term word, mapping corresponding English terms, iterating for multiple times, ending the cycle until m = n, completing semantic splitting, and forming a split vocabulary.

As shown in fig. 3:

suppose the statement to be processed is: the ' number of public deposit accounts ' is obtained by dividing each Chinese character to obtain n =7, and the corresponding coordinates of each Chinese character are ' pair (1) ', ' public (2) ', ' deposit (3), ' money (4), ' account (5), ' account (6), ' number (7) ".

The first iteration A1= pair (1), then the corresponding combined cumulative word is: the number of the 7 new words of 'right, fair account' is total, when the word bank is matched, only 'right, fair' is obtained, the word bank is hit, so Bn belongs to [ right, fair ], the longest semantic word is 'fair', that is, m =2, iteration 2 starts from A3= deposit (3), and splits the longest semantic word into "deposit", iteration 3 starts from A5= account (5), splits the longest semantic word into "account", iteration 4 starts from A7= number (7), and splits the longest semantic word into "number". After splitting, the finally split semantic participles are 'to public', 'deposit', 'account' and 'number'.

After the splitting, the computation complexity is approximately equal difference number series and is summed as follows: sn = [ n (a 1+ an) ]/2, assuming that the processing time of each iteration is similar, the time complexity is determined by the iteration times, i.e. the number of chinese characters in the sentence is: o (n).

After the Chinese semantic analysis, split vocabulary is formed, and is processed by a Chinese-to-English module of the device, such as ' converting the ' official ' into ' Corp ', ' converting the deposit ' into ' Dpst ', ' converting the account ' into ' Acct ' and ' converting the number ' into ' Cnt '.

In another technical scheme, if no vocabulary is hit when the word stock is compared in the second step and the third step, the longest semantic word with the longest An length is accumulated by combining the elements A1 as a split word, and the mapped English term is represented in the form of a placeholder. After the iteration of the A1 word is finished, the smallest unit semantic word with the longest length is read from the Bn and is used as a split term word, if any word bank is missed in the processing process, the A1 combination accumulated An longest semantic word is used as the split term, and corresponding English is represented as English mapping words in a form of a "+" placeholder. Assuming that the number does not exist, the integrated Chinese is "Pair public _ Credit _ Account _ number +", and English is "Corp _ Dpst _ Acct _ +".

In another technical scheme, the system further comprises a target semantic integration module, which is used for splicing two split adjacent words and/or phrases by using preset symbols, and is used for splicing corresponding adjacent english terms by using preset symbols.

In the above technical solution, the processed chinese participles and english terms are processed by the target semantic integration module, and the words are spliced in a form of "_" underlining, for example, the participles after chinese splitting are integrated into "to public _ credit _ account _ number", and the english terms are integrated into "Corp _ dspst _ acc _ Cnt", and in order to meet various requirements, the english terms also provide two modes of full capitalization (e.g., "Corp _ Dpst _ acc _ Cnt") and full lowercase ("Corp _ Dpst _ acc _ Cnt"). The Chinese integration is mainly to display the split vocabulary after output, so that the term lexicon is convenient to adjust and the splitting accuracy is increased, and the English term integration is to make the term lexicon accord with the financial term specification. If a term is encountered in which the thesaurus does not exist, the module will start the "+" placeholder mechanism, assuming that the "number" does not exist, and after the integration, the Chinese is "Pair _ Credit _ Account _ number +", and the English is "Corp _ Dpst _ Acct _ +".

In another technical scheme, the target semantic integration module is used for providing that English terms corresponding to the sentences are displayed in a humped mode, a full upper case and a full lower case mode. The Chinese terms and English terms processed by the target meaning integration module are processed by the result output module, and are output in a console standard mode for single sentences and in an Excel file form for batch sentences, such as Excel batch output shown in FIG. 5.

the method comprises the steps of firstly, obtaining a sentence to be processed, and splitting the sentence into a plurality of words and/or phrases according to financial domain terms in the sentence sequence, wherein a word bank containing a plurality of financial domain terms is preset;

step two, converting the split words and/or expressions into English terms;

and step three, outputting the sentence to be processed and the corresponding English term.

In the technical scheme, the sentences can be split according to the financial field terms, the financial field semantics can be accurately analyzed, and the sentences are mapped with English terms, so that the aim of accurate translation is fulfilled.

Compared with the prior art, the method is more accurate in word segmentation accuracy, suitable for the financial industry, more in output modes, wider in environment dependence and capable of being used as long as JDK can be installed, and the copyright of an Excel system is avoided.

While embodiments of the invention have been described above, it is not intended to be limited to the details shown, described and illustrated herein, but is to be accorded the widest scope consistent with the principles and novel features herein disclosed, and to such extent that such modifications are readily available to those skilled in the art, and it is not intended to be limited to the details shown and described herein without departing from the general concept as defined by the appended claims and their equivalents.

Claims

1. An apparatus for converting a term in the financial domain into english, comprising:

the Chinese semantic analysis word segmentation module is used for acquiring a sentence to be processed and splitting the sentence into a plurality of words and/or words according to financial domain terms in the sentence sequence;

and the result output module is used for outputting the sentence to be processed and the corresponding English term.

2. The apparatus for converting a term in a financial domain into english as claimed in claim 1, wherein the thesaurus comprises a built-in thesaurus including a plurality of terms in a financial domain and a custom thesaurus for adding a new term, and the custom thesaurus has a higher priority than the built-in thesaurus.

3. The apparatus for converting a term in a financial domain into English according to claim 1, wherein if there is no word hit in the word bank in the second step and the third step, the longest semantic word with the longest An length is accumulated as a split word in the element A1 combination, and the mapped English term is represented in the form of a placeholder.

4. The apparatus for converting financial domain terms into english as claimed in claim 1, further comprising a target semantic integration module for splicing the split adjacent words and/or phrases by using preset symbols and for splicing the corresponding adjacent english terms by using preset symbols.

5. The apparatus for converting financial domain terms into english as claimed in claim 4, wherein the target semantic integration module is configured to provide english terms corresponding to the sentence to be displayed in humped, capitalized and lowercase forms.

6. The apparatus for converting financial domain terminology into english as recited in claim 1, wherein said result output module outputs the result in one of a console mode and an Excel file.

7. The method for converting the term of the financial field into English is characterized by comprising the following steps:

the method for splitting the statement comprises the following steps:

step a, splitting sentences one by one according to Chinese words, calculating the number n of the Chinese words, forming An ordered set An = (A1, A2, am.. An) by the split Chinese words according to the sentence sequence, and setting the number n of the Chinese words as the iteration cycle number;

step c, reading the minimum unit semantic word with the longest length in the set Bn as a first splitting term word, wherein the length of the first splitting term word is m, starting 2 nd iteration by using an A (m + 1) word, reading the minimum unit semantic word with the longest length from the 2 nd round Bn set as a second hit splitting term word, mapping corresponding English terms, repeating for multiple times until m = n is completed, completing semantic splitting, and forming a splitting vocabulary;

step two, converting the split words and/or expressions into English terms;