CN103793395A - Mass multi-language resource rapidly searching and reusing method - Google Patents

Mass multi-language resource rapidly searching and reusing method Download PDF

Info

Publication number
CN103793395A
CN103793395A CN201210423241.8A CN201210423241A CN103793395A CN 103793395 A CN103793395 A CN 103793395A CN 201210423241 A CN201210423241 A CN 201210423241A CN 103793395 A CN103793395 A CN 103793395A
Authority
CN
China
Prior art keywords
language
multilingual
assets
database
searching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201210423241.8A
Other languages
Chinese (zh)
Inventor
杜金林
朱懿
杜勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Translated by Mdt InfoTech Ltd, Shanghai
Original Assignee
SHANGHAI YONGJINYI INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI YONGJINYI INFORMATION TECHNOLOGY Co Ltd filed Critical SHANGHAI YONGJINYI INFORMATION TECHNOLOGY Co Ltd
Priority to CN201210423241.8A priority Critical patent/CN103793395A/en
Publication of CN103793395A publication Critical patent/CN103793395A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a mass multi-language resource rapidly searching and reusing method. The method comprises importing content in corpus databases or term databases in XML-based (extensive markup language) standard formats such as TMX (translation memory exchange) and TBX (term base exchange) into a specified database through a developing corpus analyzing module; during the importing process, automatically matching and setting database tables with the same content and different language pairs to automatically generate a target language multi-language database with every single source text matched with multiple texts; after a user searching request is received, returning to the searching results of the database and filtering the searching results according to the number of word segments through a filter to achieve higher searching speed and searching precision. When a user uses the method, the searching results can be automatically fed back in a translation memory mode according to a specified language pair, so that language resource reusage and value adding can be achieved, and personnel in different regions can search mass multi-language resources rapidly, further the translation efficiency and accuracy can be enhanced, and the efficiency of the language resources can be maximized; the language resources are directly saved in the text database formats, so that data losses can be avoided and resource security can be improved.

Description

A kind of fast query and the method for reusing the multilingual assets of magnanimity
Technical field
The present invention relates to a kind of fast query and the method for reusing the multilingual assets of magnanimity, for the multilingual asset search of CAT software or multilingual translation system and the development and application of sharing module, belong to multilingual machine translation mothod field.
Background technology
TM is one of extensive technology adopting in computer-aided translation (CAT) field, can significantly improve translation efficiency by TM technology, guarantees content consistency.Owing to adopting, the CAT software category of TM technological development is various, the storage format of TM content varies, for the ease of the TM exchanges data between body translation and CAT instrument, the open standard that one is called TMX (Translation Memory eXchange) has been successfully applied to localization and translation industry.
In the process of software and Website localization translation, need content data file repeatability to be processed larger, in addition because content update is frequent, and be all the renewal based on last revision, just increase a small amount of fresh content or original content has been carried out to a small amount of correction, the content that before making full use of so necessary, version has been translated, and do not need again to translate.
TM technology reuses these contents of having translated effectively, it adopts the mode in segment (Segment) and TM storehouse to improve the efficiency of translation, translation database as data unit, is set up corresponding linking relationship by each sentence of source language with the sentence of target language take " translation unit (Translation Unit) ".When translator adopts the CAT instrument translation content of TM, CAT instrument constantly stores the content of up-to-date translation into TM storehouse, for the content that will translate (as word, phrase, sentence, paragraph), whether it first searches for this content in TM storehouse the content of coupling, and immediate translation is provided automatically, and translator can insert the translation mating most easily.
Along with enriching constantly of translation content, the capacity in TM storehouse constantly increases, translator needn't, for the translation worries again again of identical content, only need to be absorbed in the fresh content that needs translation, and the accuracy of TM also can guarantee the consistance of identical content translation.This is the target that adopts TM technology pursue.
But, along with deepening continuously of economic globalization, the localization of software/website and globalization industry develop rapidly, echo mutually therewith, each adopts the localization tool of TM technological development and TM instrument to get more and more, but these instruments are different producers to be developed, and there is file data storage format separately in every family.In addition, for a Local Service mechanism, often for the disparity items of different clients or same client provides localized translation service, because different clients and disparity items need to be used different localization tool, often because each localization tool file data lacks the standard format that can exchange, therefore, be difficult to reuse the TM base resource of accumulation in the past.
Obviously, the standard format in TM storehouse is urgently unified, the standard of formulating translation data exchange has become the task of top priority of localization/globalization industry, and it can make industry internal services business, client and too development business strengthen the unitarity of information processing, realizes business win-win.The continuous growth of the market demand and the dual promotion of TM technology, just make TMX standard ready to appear just.
  
The target of TMX is to slow down the exchange of translation data memory between different instruments and/or body translation, reduces or avoid the loss of significant data in exchange process.TMX is under the prerequisite of assurance translation data content, and for difference is localized and translation tool is formulated neutral data exchange standard, on market, increasing localized translation tool provides the support to TMX standard now.
According to OSCAR(Open Standards for Container/Content Allowing Re-use) the industry survey result of tissue shows, TM resource has become the ever-increasing strategic assets of localization/globalization service organization, value up to more than 1,000,000 dollars to a certain extent plays an important role in hundreds of millions of international business affairs.
Localization/globalization service provider stores the TM resource of oneself conventionally with TMX, term resources is stored with TBX, thereby the function of these corporate assets that preserve value, makes them not be subject to the constraint of certain computer assisted translation tools, can not cause damage along with the renewal of market and technology.
In sum, along with deepening continuously of economic globalization, the localization of software/website and globalization industry develop rapidly, contribute to promote output and quality except the language assets (TM and term resources) of the TMX to existing storage and TBX form are reused, and reduce costs.Conventionally TMX or TBX occur form with a language, if English is to Chinese, and the English German etc. that arrives.But the technology of industry still rests on the situation that single language is supported form, also not from existing single language to identical content automatically generate the right technology of multilingual language, thereby realize the maximization application of TMX and TBX resource.
Understand from business angle, each manufacturer wishes that user is larger to the CAT product dependence of self; But consider from user's angle, a kind of magnanimity language assets of supporting Open Standard automatically generate the right method of multilingual language from the right identical content of single language, guarantee assets security, realize the maximization application of resource, will be quite valuable.
The shortcoming 1 of prior art) existing language asset store framework is two-dimentional, unidirectional, the corresponding relation between source languages and each target language cannot be got through, and can only obtain from the original text of a source language translation of a target language.2) multiple TMX/TBX are imported into respectively in multiple translation memory/terminology banks, and these translation memory/terminology banks are differing from each other, cause the renewal of follow-up TU and term asynchronous.
Summary of the invention
For addressing the above problem, the present invention aims to provide a kind of fast query and reuses the method for the multilingual assets of magnanimity.Technical scheme of the present invention is as follows:
Fast query and a method of reusing the multilingual assets of magnanimity, comprise the following steps:
1, read out and import in the database of appointment by the content in corpus, the terminology bank of exploitation language material parsing module standard format based on XML by TMX, TBX etc.;
2,, in importing, by Auto-matching and the right database table of placement identical content different language, automatically generate a source document, the multilingual database of the target language of many couplings;
3,, in importing process, each language text will, by automatically according to language rule participle, be recorded in database;
4, receiving after user search request, database returns before Search Results, and filtrator can filter according to participle hop count, and this operation can effectively reduce useless coupling, makes inquiry velocity faster, and Query Result is more accurate;
5,, in the time that user uses, the language pair of specifying according to user, feeds back to user by the result searching with the form of translation memory automatically, presents to final user be reused with specific form;
6, in the time of increase, renewal multilingual database, will automatically upgrade multilingual related content, thereby guarantee that language assets, after dynamically updating, can continue to allow user obtain the translation memory content after renewal.
Above-described fast query and reuse the method for the multilingual assets of magnanimity, as preferred version: also comprise:
Language material parsing module: the parsing of industry standard form TMX and TBX is provided, language material information (comprising source language, target language etc.) is read in to internal memory, be converted to the function of binary object.
Client: send query statement by network, comprising: inquiry original text, target language to be checked (one or more).
Filtrator: provide the large result to returning to carry out the function of filtering in batches.
Filtrator principle:
The word hop count that filtrator had previously deposited in by analytical database, adds or deducts tolerable word hop count, filter exceed within the scope of this sentence to or term/phrase.
Suppose: sentence to word hop count m, tolerable word hop count k
Allowed band n=m+-k
Fast query of the present invention and the method for reusing the multilingual assets of magnanimity, its beneficial effect is: the language assets that exist with multilingual database form are to be physically independent of the language assets that exist with TMX and TBX form, even if multilingual database is deleted, can not have influence on original language assets, thereby guarantee the security of assets yet; And assets are all based on XML with the XML(TMX of textual form and TBX) be kept on storage medium, be different from the binary data library file that is frequently read storage by CAT instrument, its security can be protected, and can surprisingly not lose.
The directly processing to TMX and two kinds of industry standard forms of TBX, can bring following beneficial effect:
1) directly reuse the language assets that text data library format is preserved, the not fragile loss of data, has promoted assets security.
2) without manual switch form, automatic guide enters industry standard format, the reusing of implementation language assets.
3) the multilingual assets of fast query magnanimity.
Accompanying drawing explanation
Fig. 1. fast query and reuse the system chart of method of the multilingual assets of magnanimity.
Specific embodiments
Abbreviation and Key Term definition:
TM Translation Memory translation memory
TU Translation Unit translation unit
TMX Translation Memory eXchange translation memory Interchange Format
TBX Term Base eXchange terminology bank Interchange Format
CAT Computer Aided Translation computer-aided translation
LISA Localization Industry Standards Association Localization Industry ANSI
OSCAR Open Standards for Container/Content Allowing Re-use re-usable container/contents open standard
Specific embodiment is as follows:
Fast query and the method for reusing the multilingual assets of magnanimity, comprise the following steps:
1) read out and import in the database of appointment by the content in corpus, the terminology bank of exploitation language material parsing module standard format based on XML by TMX, TBX etc.;
2), in importing, by Auto-matching and the right database table of placement identical content different language, automatically generate a source document, the multilingual database of the target language of many couplings;
3), in importing process, each language text will, by automatically according to language rule participle, be recorded in database.
4) receiving after user search request, database returns before Search Results, and filtrator can filter according to participle hop count, and this operation can effectively reduce useless coupling, makes inquiry velocity faster, and Query Result is more accurate.
4), in the time that user uses, the language pair of specifying according to user, feeds back to user by the result searching with the form of translation memory automatically, presents to final user be reused with specific form;
5) in the time of increase, renewal multilingual database, will automatically upgrade multilingual related content, thereby guarantee that language assets, after dynamically updating, can continue to allow user obtain the translation memory content after renewal.
Fast query and the method for reusing the multilingual assets of magnanimity, specifically also comprise:
Language material parsing module: the parsing of industry standard form TMX and TBX is provided, language material information (comprising source language, target language etc.) is read in to internal memory, be converted to the function of binary object.
Client: send query statement by network, comprising: inquiry original text, target language to be checked (one or more).
Filtrator: provide the large result to returning to carry out the function of filtering in batches.
Filtrator principle:
The word hop count that filtrator had previously deposited in by analytical database, adds or deducts tolerable word hop count, filter exceed within the scope of this sentence to or term/phrase.
Suppose: sentence to word hop count m, tolerable word hop count k
Allowed band n=m+-k
The language assets that exist with multilingual database form are to be physically independent of the language assets that exist with TMX and TBX form, even if multilingual database is deleted, also can not have influence on original language assets, thereby guarantee the security of assets; And assets are all based on XML with the XML(TMX of textual form and TBX) be kept on storage medium, be different from the binary data library file that is frequently read storage by CAT instrument, its security can be protected, and can surprisingly not lose.
The directly processing to TMX and two kinds of industry standard forms of TBX, can bring following beneficial effect:
1) directly reuse the language assets that text data library format is preserved, the not fragile loss of data, has promoted assets security.
2) without manual switch form, automatic guide enters industry standard format, the reusing of implementation language assets.
3) the multilingual assets of fast query magnanimity.
Each manufacturer wishes that user is larger to the CAT product dependence of self, but consider from user's angle, a kind of fast query and the method for reusing the multilingual assets of magnanimity, in guaranteeing assets security, by the maximization application of implementation language resource, will be quite valuable.Adopt technical scheme of the present invention, can obtain useful result :the language assets of preserving except directly reusing text data library format, the not fragile loss of data, promote assets security, simultaneously, without manual switch form, automatic guide enters industry standard format, the reusing of implementation language assets, realize the extra increment of assets, facilitate the multilingual assets of personnel's fast query magnanimity of zones of different, promote efficiency and the accuracy of translation, the maximum efficiency of performance language assets.
The above, be only preferred embodiment of the present invention, and any non-creativeness that those skilled in the art do around this spirit improves, and all belongs to protection scope of the present invention.
  

Claims (2)

1. fast query and the method for reusing the multilingual assets of magnanimity, is characterized in that: 1) directly resolve industry standard format, generate multilingual various dimensions language to the method right with term; 2) fast query and the method for reusing to multilingual translation memory and terminology bank by the method for filtering.
2. fast query according to claim 1 and the method for reusing the multilingual assets of magnanimity, it is characterized in that: language material parsing module: the parsing of industry standard form TMX and TBX is provided, language material information (comprising source language, target language etc.) is read in to internal memory, be converted to the function of binary object; 1) client: send query statement by network, comprising: inquiry original text, target language to be checked (one or more); 2) filtrator: provide the large result to returning to carry out the function of filtering in batches; 3) filtrator principle: the word hop count that filtrator had previously deposited in by analytical database, add or deduct tolerable word hop count, filter exceed within the scope of this sentence to or term/phrase; 4) suppose: sentence to word hop count m, tolerable word hop count k; 5) 7. allowed band n=m+-k.
CN201210423241.8A 2012-10-30 2012-10-30 Mass multi-language resource rapidly searching and reusing method Pending CN103793395A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210423241.8A CN103793395A (en) 2012-10-30 2012-10-30 Mass multi-language resource rapidly searching and reusing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210423241.8A CN103793395A (en) 2012-10-30 2012-10-30 Mass multi-language resource rapidly searching and reusing method

Publications (1)

Publication Number Publication Date
CN103793395A true CN103793395A (en) 2014-05-14

Family

ID=50669078

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210423241.8A Pending CN103793395A (en) 2012-10-30 2012-10-30 Mass multi-language resource rapidly searching and reusing method

Country Status (1)

Country Link
CN (1) CN103793395A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106412080A (en) * 2016-10-21 2017-02-15 李丽亚 Method for realizing software localization and internationalization based on network service
CN107391499A (en) * 2017-08-03 2017-11-24 深圳Tcl新技术有限公司 It is automatically imported interpretation method, text importing terminal and computer-readable recording medium
CN109815390A (en) * 2018-11-08 2019-05-28 平安科技(深圳)有限公司 Search method, device, computer equipment and the computer storage medium of multilingual information
CN111930888A (en) * 2020-07-17 2020-11-13 上海泛微网络科技股份有限公司 Multi-language support method based on collaborative idea, server system and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106412080A (en) * 2016-10-21 2017-02-15 李丽亚 Method for realizing software localization and internationalization based on network service
CN107391499A (en) * 2017-08-03 2017-11-24 深圳Tcl新技术有限公司 It is automatically imported interpretation method, text importing terminal and computer-readable recording medium
CN109815390A (en) * 2018-11-08 2019-05-28 平安科技(深圳)有限公司 Search method, device, computer equipment and the computer storage medium of multilingual information
CN109815390B (en) * 2018-11-08 2023-08-08 平安科技(深圳)有限公司 Method, device, computer equipment and computer storage medium for retrieving multilingual information
CN111930888A (en) * 2020-07-17 2020-11-13 上海泛微网络科技股份有限公司 Multi-language support method based on collaborative idea, server system and storage medium

Similar Documents

Publication Publication Date Title
Sherif et al. Semantic quran
Lewis et al. Developing ODIN: A multilingual repository of annotated language data for hundreds of the world's languages
US20070233456A1 (en) Document localization
US20140172413A1 (en) Short phrase language identification
US8554538B2 (en) Generating a unique name for a data element
CN103020044A (en) Machine-aided webpage translation method and system thereof
US8862455B2 (en) Creating and implementing language-dependent string pluralizations
Stührenberg The TEI and current standards for structuring linguistic data. An overview
CN107870915B (en) Indication of search results
CN103793395A (en) Mass multi-language resource rapidly searching and reusing method
US8515977B2 (en) Delta language translation
JP2011233162A (en) System, method, and software for assessing ambiguity of medical terms
Breuer et al. ir_metadata: An extensible metadata schema for IR experiments
US10706124B2 (en) Storage and retrieval of structured content in unstructured user-editable content stores
Formenton et al. Metadata standards in web archiving technological resources for ensuring the digital preservation of archived websites
WO2019042322A1 (en) Translation data management method, storage medium and electronic equipment
CN102591859B (en) Method and relevant device for reusing industrial standard formatted files
CN113127776A (en) Breadcrumb path generation method and device and terminal equipment
CN103729346B (en) Method for dynamically generating mass language assets in multiple language industry standard formats
Conzett et al. 11 Guidance for Citing Linguistic Data
CN111563387B (en) Sentence similarity determining method and device, sentence translating method and device
CN112818070A (en) Data query method and device based on global data dictionary and electronic equipment
Roturier XML for translation technology
Gebreslassie et al. FHIR4FAIR: leveraging FHIR in health data FAIRfication process: in the case of VODAN-A
Fahl et al. Semantification of CEUR-WS with Wikidata as a target Knowledge Graph.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: SHANGHAI YOUYI INFORMATION TECHNOLOGY CO., LTD.

Free format text: FORMER OWNER: SHANGHAI YONGJINYI INFORMATION TECHNOLOGY CO., LTD.

Effective date: 20141106

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20141106

Address after: 306, room 200439, Gao Jing International Building, 101 Yin Gao Xi Road, Shanghai, Baoshan District

Applicant after: Translated by Mdt InfoTech Ltd, Shanghai

Address before: 306, room 200439, Gao Jing International Building, 101 Yin Gao Xi Road, Shanghai, Baoshan District

Applicant before: SHANGHAI YONGJINYI INFORMATION TECHNOLOGY CO., LTD.

ASS Succession or assignment of patent right

Owner name: DU JINLIN

Free format text: FORMER OWNER: SHANGHAI YOUYI INFORMATION TECHNOLOGY CO., LTD.

Effective date: 20150326

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 200439 BAOSHAN, SHANGHAI TO: 200441 BAOSHAN, SHANGHAI

TA01 Transfer of patent application right

Effective date of registration: 20150326

Address after: 200441 Shanghai Yixian Road, No. 2816 Wordsworth Pentium building B building 20 floor

Applicant after: Du Jinlin

Address before: 306, room 200439, Gao Jing International Building, 101 Yin Gao Xi Road, Shanghai, Baoshan District

Applicant before: Translated by Mdt InfoTech Ltd, Shanghai

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20160205

Address after: 200441 Shanghai Yixian Road, No. 2816 Wordsworth Pentium building B building 20 floor

Applicant after: Translated by Mdt InfoTech Ltd, Shanghai

Address before: 200441 Shanghai Yixian Road, No. 2816 Wordsworth Pentium building B building 20 floor

Applicant before: Du Jinlin

RJ01 Rejection of invention patent application after publication

Application publication date: 20140514

RJ01 Rejection of invention patent application after publication