WO2021098490A1 - 语种识别方法、装置、终端设备及计算机可读存储介质 - Google Patents

语种识别方法、装置、终端设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2021098490A1
WO2021098490A1 PCT/CN2020/125591 CN2020125591W WO2021098490A1 WO 2021098490 A1 WO2021098490 A1 WO 2021098490A1 CN 2020125591 W CN2020125591 W CN 2020125591W WO 2021098490 A1 WO2021098490 A1 WO 2021098490A1
Authority
WO
WIPO (PCT)
Prior art keywords
language
text
recognized
sample
recognition model
Prior art date
Application number
PCT/CN2020/125591
Other languages
English (en)
French (fr)
Inventor
蒲勇飞
罗俊颜
朱丽飞
王志远
施烈航
黄健超
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021098490A1 publication Critical patent/WO2021098490A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application belongs to the field of artificial intelligence (AI) and computer vision technology, and in particular relates to a language recognition method, device, terminal device, and computer-readable storage medium.
  • AI artificial intelligence
  • computer vision technology relates to a language recognition method, device, terminal device, and computer-readable storage medium.
  • the text line image can be sampled by the sliding window method to obtain multiple image blocks, and the multiple image blocks can be input into the convolutional neural network for recognition, and the language corresponding to the multiple image blocks can be obtained.
  • the language with the largest number is used as the language corresponding to the text line image.
  • the embodiments of the present application provide a language recognition method, device, terminal device, and computer-readable storage medium, which can improve the accuracy of language recognition in text line images.
  • an embodiment of the present application provides a language recognition method, including:
  • the language recognition model includes a language classification network and a convolutional network
  • the inputting the image of the text line to be recognized into the trained language recognition model to obtain the language of the text to be recognized includes:
  • the language is a language family that includes multiple languages
  • input the feature information into the convolutional network determine the linguistic law of the text to be recognized, and select from the language family languages that match the language law As the language of the text to be recognized.
  • the method before inputting the image of the text line to be recognized into the trained language recognition model, the method further includes:
  • the language of the sample text is a language family
  • the initial language recognition model under the preset conditions is used as the language recognition model.
  • the method before the input of the sample text line image in the sample set into the initial language classification network of the initial language recognition model, the method also includes:
  • the historical sample set including the sample text line image and a text identifier corresponding to each of the sample text line images
  • each of the text identifiers into language codes according to a preset code table to obtain a sample set consisting of the sample text line images and the language codes corresponding to each of the sample text line images, and the code table includes multiple Languages, and each character in each language corresponds to at least one language code.
  • the language included in the language category of the sample text and the value of the sample text are calculated according to the preset first loss function.
  • the first loss value between actual languages including:
  • the calculation of the second loss value between the language type of the sample text and the actual language type of the sample text according to the preset second loss function includes:
  • a second loss value between the language type of the sample text and the actual language type of the sample text is calculated according to the normalized exponential loss function.
  • the language classification network of the language recognition model is used to recognize the language of each character in the text to be recognized, and the language with the largest number is used as the language to be recognized. Identify the language of the text.
  • an embodiment of the present application provides a language recognition device, including:
  • An image acquisition module for acquiring an image of a text line to be recognized, where the text line image to be recognized includes the text to be recognized;
  • the recognition module is used to input the image of the text line to be recognized into a trained language recognition model to obtain the language of the text to be recognized, and the language recognition model is used to determine the text line image to be recognized according to the image of the text line to be recognized. Identify the language of the text. If the language is a language family that includes multiple languages, according to the linguistic law of the text to be recognized, the language of the language family corresponding to the language law is taken as the language of the text to be recognized Language.
  • the language recognition model includes a language classification network and a convolutional network
  • the recognition module is further configured to input the line image of the text to be recognized into the language classification network to obtain feature information of the text to be recognized, and the feature information is used to indicate the language of the text to be recognized;
  • the predicate language is a language family that includes multiple languages, the feature information is input into the convolutional network, the linguistic law of the text to be recognized is determined, and a language that matches the language law is selected from the language family of languages As the language of the text to be recognized.
  • the device further includes:
  • the first training module is used to input the sample text line image in the sample set into the initial language classification network of the initial language recognition model to obtain sample feature information of the sample text in the sample text line image, and the sample feature information is used to indicate The language of the sample text;
  • the first calculation module is configured to calculate a first loss value between the language of the sample text and the actual language of the sample text according to a preset first loss function if the language of the sample text is not a language family;
  • the second training module is used to input the sample feature information into the initial convolutional network of the initial language recognition model if the language of the sample text is the language of the language family, and select the language of the sample text from the language of the language family.
  • the language type that matches the language law is used as the language type of the sample text;
  • a second calculation module configured to calculate a second loss value between the language type of the sample text and the actual language type of the sample text according to a preset second loss function
  • the adjustment module is used to adjust the model parameters of the initial language recognition model when the first loss value or the second loss value does not meet a preset condition, and return to execute the input of the sample text line image in the sample set
  • the determining module is configured to stop training the initial language recognition model when the first loss value and the second loss value both meet the preset condition, and combine the first loss value and the second loss value The initial language recognition model when the loss values all satisfy the preset condition is used as the language recognition model.
  • the apparatus further includes:
  • a sample acquisition module configured to acquire a historical sample set, the historical sample set including a sample text line image and a text identifier corresponding to each of the sample text line images;
  • the sample generation module is used to convert each of the text identifiers into language codes according to a preset code table to obtain a sample set consisting of the sample text line images and the language code corresponding to each of the sample text line images,
  • the code table includes multiple languages, and each character in each language corresponds to at least one language code.
  • the first calculation module is further configured to calculate the language of the sample text and the sample according to the continuous time series classification loss function The first loss value between the actual languages of the text;
  • the second calculation module is further configured to calculate a second loss value between the language type of the sample text and the actual language type of the sample text according to a normalized exponential loss function.
  • the language classification network of the language recognition model is used to recognize the language of each character in the text to be recognized, and the language with the largest number is used as the language to be recognized. Identify the language of the text.
  • an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor executes the computer program, Realize the language recognition method as described in any one of the above-mentioned first aspects.
  • an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the implementation is as described in any of the above-mentioned first aspects.
  • the embodiments of the present application provide a computer program product, which when the computer program product runs on a terminal device, causes the terminal device to execute the language recognition method described in any one of the above-mentioned first aspects.
  • the image of the text line to be recognized including the text to be recognized is obtained, and the image of the text line to be recognized is input into the trained language recognition model, and the language of the text to be recognized is determined by the language recognition model. If the language includes multiple If the language type is a language type, the language type recognition model can use the language type corresponding to the language law as the language type of the text to be recognized according to the language law of the text to be recognized. Recognizing languages based on the law of language avoids the problem of inaccurate recognition caused by the ambiguity of the same or similar characters, and improves the accuracy of language recognition.
  • Figure 1 is a schematic diagram of a scene involved in the language recognition method provided by this application.
  • FIG. 2 is a block diagram of a part of the structure of a mobile phone provided by an embodiment of the present application
  • Fig. 3 is a schematic flowchart of a language recognition method provided by the present application.
  • Fig. 4 is a schematic flowchart of another language recognition method provided by the present application.
  • Fig. 5 is a schematic flowchart of a method for training a language recognition model provided by the present application
  • Fig. 6 is a structural block diagram of a language recognition device provided by an embodiment of the present application.
  • FIG. 7 is a structural block diagram of another language recognition device provided by an embodiment of the present application.
  • Fig. 8 is a structural block diagram of another language recognition device provided by an embodiment of the present application.
  • the language recognition method provided in the embodiments of this application can be applied to mobile phones, tablet computers, wearable devices, augmented reality (AR)/virtual reality (VR) devices, notebook computers, ultra-mobile personal computers (ultra-mobile personal computers).
  • AR augmented reality
  • VR virtual reality
  • PC personal digital assistant
  • the embodiments of this application do not impose any restrictions on the specific types of terminal devices.
  • the terminal device may be a station (STAION, ST) in a WLAN, a cellular phone, a cordless phone, a Session Initiation Protocol (SIP) phone, a wireless local loop (Wireless Local Loop, WLL) station, Personal Digital Assistant (PDA) devices, handheld devices with wireless communication functions, computing devices or other processing devices connected to wireless modems, computers, laptops, handheld communication devices, handheld computing devices, satellites Wireless equipment, wireless modem cards, customer premise equipment (CPE) and/or other equipment used to communicate on wireless systems and next-generation communication systems, such as mobile terminals in 5G networks or future evolutionary public Mobile terminals in the Public Land Mobile Network (PLMN) network, etc.
  • STAION, ST in a WLAN
  • a cellular phone a cordless phone
  • SIP Session Initiation Protocol
  • WLL Wireless Local Loop
  • PDA Personal Digital Assistant
  • handheld devices with wireless communication functions computing devices or other processing devices connected to wireless modems, computers, laptops, handheld communication devices, handheld
  • the wearable device can also be a general term for applying wearable technology to intelligently design daily wear and develop wearable devices, such as glasses, gloves, Watches, clothing and shoes, etc.
  • a wearable device is a portable device that is directly worn on the body or integrated into the user's clothes or accessories. Wearable devices are not only a kind of hardware device, but also realize powerful functions through software support, data interaction, and cloud interaction. In a broad sense, wearable smart devices include full-featured, large-sized, complete or partial functions that can be realized without relying on smart phones, such as smart watches or smart glasses.
  • FIG. 1 is a schematic diagram of a scene involved in the language recognition method provided by the present application. As shown in FIG. 1, the scene includes: a terminal device 110 and an object 120 to be photographed.
  • the object to be photographed 120 may include a text to be recognized, and the terminal device 110 may photograph the object to be photographed 120 to obtain a line image of the text to be recognized including the text to be recognized.
  • the terminal device 110 can run the pre-trained language recognition model during the operation process, and the terminal device 110 can recognize the text to be recognized in the image to be recognized through the language recognition model, thereby determining the language corresponding to the text to be recognized.
  • the image to be recognized can be not only an image that includes the text to be recognized taken by the terminal device 110, but also an image that includes the text to be recognized pre-stored by the terminal device 110, or an image acquired through wireless transmission. There is no limitation on the way to obtain the image to be recognized.
  • the terminal device 110 may obtain the image to be recognized, and input the image to be recognized into a pre-trained language recognition model, and use the language recognition model to recognize each character of the text to be recognized, so that it can be based on multiple The language corresponding to the character is determined, and the language corresponding to the text to be recognized is determined.
  • the language recognition model can include language classification networks and convolutional networks, and the language classification network is used to identify the language corresponding to the text to be recognized, but when the language is a language family that includes multiple languages, the language classification The network cannot determine the language of the text to be recognized, which language of the language family, it can learn the linguistic law of the text to be recognized through the convolutional network, and determine the language corresponding to the text to be recognized according to the linguistic law.
  • the terminal device 110 in the embodiment of the present application may be a terminal device 110 in the field of terminal artificial intelligence, which is applied to the field of computer technology.
  • the terminal device 110 may recognize the text in the scene, determine the language of the text and the content corresponding to the text. For example, the terminal device 110 may recognize the English sentence in the scene, determine that the text belongs to English, and translate the text to obtain the Chinese sentence corresponding to the English sentence.
  • Fig. 2 is a block diagram of a part of the structure of a mobile phone provided by an embodiment of the present application.
  • the mobile phone includes: a radio frequency (RF) circuit 210, a memory 220, an input unit 230, a display unit 240, a sensor 250, an audio circuit 260, a wireless fidelity (WiFi) module 270, and a processor 280 , And power supply 290 and other components.
  • RF radio frequency
  • RF radio frequency
  • memory 220 the mobile phone includes: a radio frequency (RF) circuit 210, a memory 220, an input unit 230, a display unit 240, a sensor 250, an audio circuit 260, a wireless fidelity (WiFi) module 270, and a processor 280 , And power supply 290 and other components.
  • WiFi wireless fidelity
  • the structure of the mobile phone shown in FIG. 2 does not constitute a limitation on the mobile phone, and may include more or less components than those shown in the figure, or a combination of some components
  • the RF circuit 210 can be used for receiving and sending signals during the process of sending and receiving information or talking. In particular, after receiving the downlink information of the base station, it is processed by the processor 280; in addition, the designed uplink data is sent to the base station.
  • the RF circuit includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (LNA), a duplexer, and the like.
  • the RF circuit 210 can also communicate with the network and other devices through wireless communication.
  • the above-mentioned wireless communication can use any communication standard or protocol, including but not limited to Global System of Mobile Communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (Code Division) Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE)), Email, Short Messaging Service (SMS), etc.
  • GSM Global System of Mobile Communication
  • GPRS General Packet Radio Service
  • CDMA Code Division Multiple Access
  • WCDMA Wideband Code Division Multiple Access
  • LTE Long Term Evolution
  • Email Short Messaging Service
  • the memory 220 may be used to store software programs and modules.
  • the processor 280 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 220.
  • the memory 220 may mainly include a program storage area and a data storage area.
  • the program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of mobile phones (such as audio data, phone book, etc.), etc.
  • the memory 220 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the input unit 230 may be used to receive inputted digital or character information, and generate key signal input related to user settings and function control of the mobile phone.
  • the input unit 230 may include a touch panel 231 and other input devices 232.
  • the touch panel 231 also called a touch screen, can collect the user's touch operations on or near it (for example, the user uses any suitable objects or accessories such as fingers, stylus, etc.) on the touch panel 231 or near the touch panel 231. Operation), and drive the corresponding connection device according to the preset program.
  • the touch panel 231 may include two parts: a touch detection device and a touch controller.
  • the touch detection device detects the user's touch position, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it To the processor 280, and can receive and execute the commands sent by the processor 280.
  • the touch panel 231 can be implemented in multiple types such as resistive, capacitive, infrared, and surface acoustic wave.
  • the input unit 230 may also include other input devices 232.
  • other input devices 232 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control buttons, switch buttons, etc.), trackball, mouse, joystick, and the like.
  • the display unit 240 may be used to display information input by the user or information provided to the user and various menus of the mobile phone.
  • the display unit 240 may include a display panel 241.
  • the display panel 241 may be configured in the form of a liquid crystal display (LCD), an organic light-emitting diode (OLED), etc.
  • the touch panel 231 can cover the display panel 241. When the touch panel 231 detects a touch operation on or near it, it transmits it to the processor 280 to determine the type of the touch event, and then the processor 280 responds to the touch event. The type provides corresponding visual output on the display panel 241.
  • the touch panel 231 and the display panel 241 are used as two independent components to realize the input and input functions of the mobile phone, but in some embodiments, the touch panel 231 and the display panel 241 may be integrated. Realize the input and output functions of the mobile phone.
  • the processor 280 is the control center of the mobile phone. It uses various interfaces and lines to connect various parts of the entire mobile phone. It executes by running or executing software programs and/or modules stored in the memory 220 and calling data stored in the memory 220. Various functions and processing data of the mobile phone can be used to monitor the mobile phone as a whole.
  • the processor 280 may include one or more processing units; preferably, the processor 280 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, application programs, etc. , The modem processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 280.
  • the mobile phone also includes a power source 290 (such as a battery) for supplying power to various components.
  • a power source 290 such as a battery
  • the power source can be logically connected to the processor 280 through a power management system, so that functions such as charging, discharging, and power management are realized through the power management system.
  • the mobile phone may also include a camera.
  • the position of the camera on the mobile phone may be front or rear, which is not limited in the embodiment of the present application.
  • the mobile phone may include a single camera, a dual camera, or a triple camera, etc., which is not limited in the embodiment of the present application.
  • a mobile phone may include three cameras, of which one is a main camera, one is a wide-angle camera, and one is a telephoto camera.
  • the multiple cameras may be all front-mounted, or all rear-mounted, or partly front-mounted and another part rear-mounted, which is not limited in the embodiment of the present application.
  • the mobile phone may also include a Bluetooth module, etc., which will not be repeated here.
  • FIG. 3 is a schematic flowchart of a language recognition method provided by the present application. As an example and not a limitation, the method may be applied to the above-mentioned terminal device 110. As shown in FIG. 3, the method may include:
  • the terminal device can obtain the image including the text to be recognized, obtain the text line image to be recognized, and detect the text to be recognized in the text line image to be recognized, so that the language of each character in the text to be recognized can be determined according to the language of the text to be recognized. Language.
  • the terminal device may photograph the text to be recognized according to a preset photographing function, and obtain a line image of the text to be recognized including the text to be recognized. For example, after a user-triggered operation to turn on the shooting function is detected, the shooting interface can be displayed, and the text to be recognized for shooting can be displayed in the shooting interface. If the shooting operation triggered by the user is detected, the image displayed on the shooting interface can be displayed. Store it and get the image to be inspected.
  • the text line image to be recognized can also be obtained in other ways.
  • the text line image to be recognized can be selected from the storage space of the terminal device according to an operation triggered by the user.
  • the method is not limited.
  • S302 Input the image of the text line to be recognized into the trained language recognition model to obtain the language of the text to be recognized.
  • the language recognition model can be used to determine the language of the text to be recognized according to the line image of the text to be recognized. If the language is a language family that includes multiple languages, then according to the linguistic rules of the text to be recognized, the language of the language family and the language The language type corresponding to the rule is used as the language type of the text to be recognized.
  • the image of the text line to be recognized can be input into the trained language recognition model, so that the text to be recognized in the text line image to be recognized can be recognized through the language recognition model, and the text to be recognized can be determined Language, so that after determining the language of the text to be recognized, the text corresponding to each character in the text to be recognized can be accurately recognized according to the recognized language.
  • the terminal device can run the language recognition model through a preset central processing unit or a special neural computing unit, and input the image of the text line to be recognized into the language.
  • the recognition model uses the neural network in the language recognition model to detect and analyze the text to be recognized in the line image of the text to be recognized to determine the language of the text to be recognized.
  • the terminal device can display the recognized language on the display screen.
  • the image of the text line to be recognized can be displayed on the display screen of the mobile terminal, and in the text line image to be recognized, the text to be recognized can be identified by means of frame selection, and at the same time, the language of the text to be recognized can be displayed near the frame selection area.
  • the language recognition method obtains an image of the text line to be recognized including the text to be recognized, and inputs the image of the text line to be recognized into the trained language recognition model, and the language recognition model determines the text line image to be recognized. Identify the language of the text. If the language is a language family that includes multiple languages, the language recognition model can use the language corresponding to the language law in the language family as the language of the text to be recognized based on the linguistic law of the text to be recognized. Recognizing languages based on the law of language avoids the problem of inaccurate recognition caused by the ambiguity of the same or similar characters, and improves the accuracy of language recognition.
  • FIG. 4 is a schematic flowchart of another language recognition method provided by the present application. As an example and not a limitation, the method may be applied to the above-mentioned terminal device 110. As shown in FIG. 4, the method may include:
  • S402 Input the line image of the text to be recognized into the language classification network to obtain feature information of the text to be recognized.
  • the feature information is used to indicate the language of the text to be recognized.
  • the feature information of the text to be recognized is multiple letters, it means that the language of the text to be recognized may be a language of the Latin family including English, German, Italian, French and other languages.
  • the feature information of the text to be recognized is each square character in the Chinese language, or the katakana in the Japanese language, it means that the language of the text to be recognized is the Chinese language or the Japanese language.
  • the pre-trained language recognition model can include language classification network and convolutional network, and language classification network and convolutional network play different roles in the process of language recognition. After obtaining the image of the text line to be recognized, you can first Input the image of the text line to be recognized into the language classification network to determine the language of the text to be recognized.
  • the image of the text line to be recognized can be input into the language classification network, and the image of the text line to be recognized can be denoised and feature extraction performed through the language classification network, and the text line image to be recognized can be obtained based on the extraction.
  • the image of the text line to be recognized can be input into the language classification network, and the image of the text line to be recognized can be denoised and feature extraction performed through the language classification network, and the text line image to be recognized can be obtained based on the extraction.
  • the recognized language can be used as the language of the text to be recognized.
  • S403 can be performed to perform further recognition through a convolutional network, and determine the language of the text to be recognized from the multiple languages included in the language family.
  • the language classification network of the language recognition model can be used to identify the language of each character in the text to be recognized, and the language with the largest number is used as the language of the text to be recognized.
  • the language classification network can count the language of each character to determine the language with the largest number of occurrences in the text to be recognized, that is, The language that occupies the largest proportion in the language of each character, and this language is used as the language of the text to be recognized.
  • the language with the largest proportion among the language types of the characters can also be used as the language of the text to be recognized in the above manner.
  • the text to be recognized is "the English word corresponding to porcelain is china"
  • the language corresponding to 10 characters in the text to be recognized is Chinese
  • the language corresponding to 5 characters is the language of the Latin family
  • the to be recognized can be determined
  • the language of the text is Chinese.
  • S403 If the language is a language family that includes multiple languages, input the feature information into the convolutional network to determine the linguistic law of the text to be recognized, and select a language that matches the language law from the language family as the language of the text to be recognized.
  • the feature information output by the language classification network can be further recognized through the convolutional network of the language recognition model to determine the linguistic law of the text to be recognized, which can be based on the language law Determine the language of the text to be recognized.
  • the feature information output by the language classification network can be input into the convolutional network.
  • the convolutional network can be used for the character and other adjacent characters. The sequence of characters is studied to obtain the linguistic law of the text to be recognized, and then according to the language corresponding to the linguistic law, from the multiple languages included in the language family of the text to be recognized, the language that matches the language law is selected as the language to be recognized The language of the text.
  • the feature information output by the language classification network can be “m”, “y”, “n”, “a”, “m”, “e”, “ i”, “s”, “Z”, "h”, "a”, “n”, “g”, “S”, “a” and “n”, the language corresponding to each of the above characters can be Latin The language of the language.
  • the above-mentioned characters can be convolved through a convolutional network, and the words “my", “name” and “is” can be recognized, and the word order of each word can be combined to determine that the language of the text to be recognized is Latin. Chinese in English.
  • the language classification network of the language recognition model in the embodiment of the present application may be a Fully Convolutional Network (FCN), and the convolutional network may be a one-dimensional convolutional network.
  • FCN Fully Convolutional Network
  • the language recognition method obtains an image of the text line to be recognized including the text to be recognized, and inputs the image of the text line to be recognized into the trained language recognition model, and the language recognition model determines the text line image to be recognized. Identify the language of the text. If the language is a language family that includes multiple languages, the language recognition model can use the language corresponding to the language law in the language family as the language of the text to be recognized based on the linguistic law of the text to be recognized. Recognizing languages based on the law of language avoids the problem of inaccurate recognition caused by the ambiguity of the same or similar characters, and improves the accuracy of language recognition.
  • the language classification network composed of a full convolutional network can quickly recognize the text line image to be recognized, and can make full use of the line sequence information of the text line image to be recognized, reducing the time spent on language recognition and improving The accuracy of identifying languages.
  • a convolutional network composed of a one-dimensional convolutional network can learn the linguistic rules of the text to be recognized, so that the language of the text to be recognized can be selected from multiple languages included in the language family according to the learned linguistic rules to avoid The problem of inaccurate language recognition due to the same or similar characters is solved, and the accuracy of language recognition is improved.
  • the foregoing embodiment is implemented based on the language recognition model in the terminal device, and the language recognition model can be obtained by training based on a large number of sample text line images, see Figure 5, which is a method for training a language recognition model provided by this application
  • the schematic flowchart of is used as an example and not a limitation. The method may be applied to the above-mentioned terminal device 110 or a server linked to the terminal device 110.
  • the method may include:
  • the historical sample set may include a large number of sample text line images, and each sample text line image may correspond to a text identifier, and the text identifier is used to indicate the sample text in the sample text line image.
  • the text identifier corresponding to the sample text line image can indicate each Chinese character in the sample text line image; if the sample text line image includes English text, the sample text line image The corresponding text identifier may indicate each English character in the sample text line image.
  • S502 Convert each text identifier into a language code according to a preset code table, to obtain a sample set consisting of a sample text line image and a language code corresponding to each sample text line image.
  • the code table may include multiple languages, and each character in each language corresponds to at least one language code.
  • each language includes a large number of characters, and each character is quite different from the characters of other languages
  • the code table shows the codes cn, ja, and ko corresponding to Chinese, Japanese, and Korean, and each code corresponds to Each character of the corresponding language.
  • the number of characters corresponding to each language is relatively small, but similar to the characters of other languages, multiple languages such as English, German, and French can be used as language family languages, and Uniform encoding of characters in each language.
  • the same characters in different languages can correspond to a language code, and the code table shown in Table 2 is obtained.
  • the code table shows the codes of the same characters and different characters between Russian and Latin in the language family.
  • the text identifier corresponding to each sample text line image in the historical sample set can be converted according to the preset code table in S502, so that the language code corresponding to each character in the sample text can be obtained. , And then generate a sample set composed of sample text line images and corresponding language codes.
  • S503 Input the sample text line image in the sample set into the initial language classification network of the initial language recognition model to obtain sample feature information of the sample text in the sample text line image.
  • the sample feature information is used to indicate the language of the sample text.
  • the recognized language of the sample text is not a language family, it means that the recognized language includes only one language, and the language of the sample text can be determined as the recognized language, and there is no need to determine the language of the sample text according to the initial convolutional network. For further confirmation.
  • the determined language of the sample text and the actual language indicated by the language code corresponding to the sample text line image can be calculated in combination with the preset first loss function to obtain the first loss between the two languages Value, so that in the subsequent steps, it can be determined whether the initial language recognition model needs to be trained again according to the first loss value. That is, after the first loss value is calculated, S507 can be executed to determine whether it is necessary to continue training the initial language recognition model.
  • the first loss value between the language type of the sample text and the actual language type of the sample text can be calculated according to the continuous time series classification loss function (Connectionist Temporal Classification Loss, CTCLoss).
  • S505 can be executed to further recognize the language of the sample text through the initial convolutional network.
  • S505 If the language of the sample text is a language family, input the sample feature information into the initial convolutional network of the initial language recognition model, and select a language that matches the language law of the sample text from the language family as the language of the sample text.
  • S506 Calculate a second loss value between the language type of the sample text and the actual language type of the sample text according to the preset second loss function.
  • the second loss function may be a normalized exponential loss function (SoftMaxLoss), and the second loss value between the language of the sample text and the actual language of the sample text can be calculated according to the normalized exponential loss function.
  • SoftMaxLoss normalized exponential loss function
  • the initial language recognition model After the first loss value or the second loss value is calculated, it can be judged whether the first loss value or the second loss value meets the preset conditions set in advance. If the preset conditions are not met, the initial language recognition model has not converged , The initial language recognition model needs to be trained again until the preset conditions are met.
  • the first loss value or the second loss value may be compared with a preset loss threshold value that matches the first loss function or the second loss function to determine the first loss value or Whether the second loss value is less than or equal to the corresponding loss threshold.
  • the initial language can be identified according to the first loss value or the second loss value Adjust the model parameters, and execute S503, S504, and S507 again, or execute S503 and S505 to S507 again, that is, input the sample text line image into the initial language recognition model after adjusting the model parameters, so as to obtain again according to the calculation
  • the first loss value or the second loss value of is adjusted and trained on the initial language recognition model until both the first loss value and the second loss value meet the preset conditions.
  • conditional initial language recognition model be the language recognition model.
  • S503 and S505 to S507 can be executed again, and if the second loss value calculated this time also meets the preset conditions With preset conditions, the training of the initial language recognition model can be stopped, and the current initial language recognition model can be used as the language recognition model.
  • the first loss value may be obtained again.
  • Each first loss value calculated before the second loss value satisfies the preset condition satisfies the preset condition, then it can be determined that both the first loss value and the second loss value meet the preset condition.
  • S503, S504, and S507 can be executed again to determine whether the recalculated first loss value meets the preset condition, and if the preset condition is met, stop Train the initial language recognition model, and use the current initial language recognition model as the language recognition model.
  • the second loss value obtained each time meets the preset condition it can be determined that both the first loss value and the second loss value meet the preset condition.
  • any calculated second loss value does not meet the preset condition before it is determined that the first loss value meets the preset condition, it cannot be determined that both the first loss value and the second loss value meet the preset condition.
  • the method for training a language recognition model obtained by the embodiments of the present application obtains a historical sample set, and converts the text identifiers in the historical sample set according to a preset code table to obtain a sample set, and then pass the sample set Train the initial language recognition model to obtain a language recognition model with the first loss value and the second loss value satisfying the preset conditions.
  • the language recognition model to recognize the text to be recognized in the image of the text line to be recognized, it can be recognized based on the law of language
  • the language type avoids the problem of inaccurate recognition caused by the ambiguity of the same or similar characters, and improves the accuracy of the recognition language.
  • the historical sample set is used to obtain the sample set for training the initial language recognition model without generating sample data, which reduces the cost of training the initial language recognition model.
  • FIG. 6 is a structural block diagram of a language recognition device provided in an embodiment of the present application. For ease of description, only the parts related to the embodiment of the present application are shown.
  • the device includes:
  • the image acquisition module 601 is configured to acquire a text line image to be recognized, and the text line image to be recognized includes the text to be recognized;
  • the recognition module 602 is used to input the image of the text line to be recognized into a trained language recognition model to obtain the language of the text to be recognized.
  • the language recognition model is used to determine the language of the text to be recognized based on the image of the text line to be recognized If the language type is a language type including multiple languages, according to the language law of the text to be recognized, the language type corresponding to the language law in the language type is used as the language type of the text to be recognized.
  • the language recognition model includes a language classification network and a convolutional network
  • the recognition module 602 is also used to input the line image of the text to be recognized into the language classification network to obtain feature information of the text to be recognized.
  • the feature information is used to indicate the language of the text to be recognized; if the language includes multiple The language of the language family, the feature information is input into the convolutional network, the linguistic law of the text to be recognized is determined, and a language matching the language law is selected from the language family of languages as the language of the text to be recognized.
  • the device further includes:
  • the first training module 603 is used to input the sample text line image in the sample set into the initial language classification network of the initial language recognition model to obtain sample feature information of the sample text in the sample text line image, and the sample feature information is used to indicate the The language of the sample text;
  • the first calculation module 604 is configured to calculate a first loss value between the language of the sample text and the actual language of the sample text according to a preset first loss function if the language of the sample text is not a language family;
  • the second training module 605 is used to input the sample feature information into the initial convolutional network of the initial language recognition model if the language of the sample text is a language family, and select from the language family language to match the linguistic law of the sample text As the language of the sample text;
  • the second calculation module 606 is configured to calculate a second loss value between the language type of the sample text and the actual language type of the sample text according to a preset second loss function;
  • the adjustment module 607 is configured to adjust the model parameters of the initial language recognition model when the first loss value or the second loss value does not meet the preset condition, and return to execute inputting the sample text line image in the sample set into the initial language Identify the initial language classification network of the model, and obtain the sample feature information of the sample text in the sample text line image and subsequent steps;
  • the determining module 608 is configured to stop training the initial language recognition model when the first loss value and the second loss value both meet the preset condition, and both the first loss value and the second loss value meet the The initial language recognition model under preset conditions is used as the language recognition model.
  • the device further includes:
  • the sample acquisition module 609 is configured to acquire a historical sample set, the historical sample set includes a sample text line image and a text identifier corresponding to each sample text line image;
  • the sample generation module 610 is configured to convert each text identifier into a language code according to a preset code table to obtain a sample set consisting of the sample text line image and the language code corresponding to each sample text line image, the code
  • the table includes multiple languages, and each character in each language corresponds to at least one language code.
  • the first calculation module 604 is further configured to calculate the first loss value between the language of the sample text and the actual language of the sample text according to the continuous time series classification loss function;
  • the second calculation module 606 is further configured to calculate a second loss value between the language of the sample text and the actual language of the sample text according to the normalized exponential loss function.
  • the language classification network of the language recognition model is used to identify the language of each character in the text to be recognized, and the language with the largest number is used as the language of the text to be recognized.
  • the language recognition device obtains the image of the text line to be recognized including the text to be recognized, and inputs the image of the text line to be recognized into the trained language recognition model, and the language recognition model determines the text line image to be recognized. Identify the language of the text. If the language is a language family that includes multiple languages, the language recognition model can then use the language corresponding to the language law in the language family as the language of the text to be recognized based on the linguistic law of the text to be recognized. Recognizing languages based on the law of language avoids the problem of inaccurate recognition caused by the ambiguity of the same or similar characters, and improves the accuracy of language recognition.
  • An embodiment of the present application also provides a terminal device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
  • a terminal device including a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
  • the processor executes the computer program, the implementation is as shown in FIG. 3 To the language recognition method described in any one of the embodiments corresponding to FIG. 5.
  • An embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, any one of the embodiments corresponding to FIG. 3 to FIG. 5 is implemented.
  • the described language recognition method is also provided.
  • the disclosed device and method may be implemented in other ways.
  • the system embodiment described above is merely illustrative.
  • the division of the modules or units is only a logical function division.
  • there may be other division methods for example, multiple units or components may be Combined or can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the computer program can be stored in a computer-readable storage medium.
  • the computer program can be stored in a computer-readable storage medium.
  • the steps of the foregoing method embodiments can be implemented.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate forms.
  • the computer-readable medium may at least include: any entity or device capable of carrying computer program code to a terminal device, a recording medium, a computer memory, a read-only memory (ROM, Read-Only Memory), and a random access memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal and software distribution medium.
  • ROM read-only memory
  • RAM random access memory
  • electric carrier signal telecommunications signal and software distribution medium.
  • U disk mobile hard disk, floppy disk or CD-ROM, etc.
  • computer-readable media cannot be electrical carrier signals and telecommunication signals.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Character Discrimination (AREA)

Abstract

本申请适用于终端人工智能领域以及对应的计算机视觉技术领域,提供了一种语种识别方法、装置、终端设备及计算机可读存储介质,所述方法包括:获取待识别文本行图像,所述待识别文本行图像包括待识别文本;将所述待识别文本行图像输入训练后的语种识别模型,得到所述待识别文本的语种,所述语种识别模型用于根据所述待识别文本行图像,确定所述待识别文本的语种,若所述语种为包括多种语种的语系语种,则根据所述待识别文本的语言规律,将所述语系语种中与所述语言规律对应的语种作为所述待识别文本的语种。基于语言规律识别语种,避免了相同或相似字符的二义性导致识别不准确的问题,提高了语种识别的准确率。

Description

语种识别方法、装置、终端设备及计算机可读存储介质
本申请要求于2019年11月22日提交国家知识产权局、申请号为2019111583571、申请名称为“语种识别方法、装置、终端设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于人工智能(Artificial Intelligence,AI)及计算机视觉技术领域,尤其涉及一种语种识别方法、装置、终端设备及计算机可读存储介质。
背景技术
随着文字识别技术的不断发展,在对文字进行识别的过程中,不但可以对中文进行识别,还可以对其他语种的文字进行识别。为了提高对不同语种的文字识别的准确度,可以先对待识别文字对应的语种进行识别。
相关技术中,可以通过滑窗的方式对文本行图像进行采样,得到多个图像块,并将多个图像块输入卷积神经网络进行识别,得到多个图像块对应的语种,最后识别得到的数量最多的语种作为文本行图像所对应的语种。
然而,对于使用相同或相似字符的语种,采用上述方式进行识别会得到大量的二义性图像,影响语种识别的准确率。
发明内容
本申请实施例提供了一种语种识别方法、装置、终端设备及计算机可读存储介质,可以提高文本行图像中语种识别的准确率。
第一方面,本申请实施例提供了一种语种识别方法,包括:
获取待识别文本行图像,所述待识别文本行图像包括待识别文本;
将所述待识别文本行图像输入训练后的语种识别模型,得到所述待识别文本的语种,所述语种识别模型用于根据所述待识别文本行图像,确定所述待识别文本的语种,若所述语种为包括多种语种的语系语种,则根据所述待识别文本的语言规律,将所述语系语种中与所述语言规律对应的语种作为所述待识别文本的语种。
在第一方面的第一种可能的实现方式中,所述语种识别模型包括语种分类网络和卷积网络;
所述将所述待识别文本行图像输入训练后的语种识别模型,得到所述待识别文本的语种,包括:
将所述待识别文本行图像输入所述语种分类网络,得到所述待识别文本的特征信息,所述特征信息用于指示所述待识别文本的语种;
若所述语种为包括多种语种的语系语种,将所述特征信息输入所述卷积网络,确定所述待识别文本的语言规律,并从所述语系语种中选取与所述语言规律相匹配的语种作为所述待识别文本的语种。
在第一方面的第二种可能的实现方式中,在所述将所述待识别文本行图像输入训 练后的语种识别模型之前,所述方法还包括:
将样本集合中的样本文本行图像输入初始语种识别模型的初始语种分类网络,得到所述样本文本行图像中样本文本的样本特征信息,所述样本特征信息用于指示所述样本文本的语种;
若所述样本文本的语种不是语系语种,根据预设的第一损失函数计算所述样本文本的语种和所述样本文本的实际语种之间的第一损失值;
若所述样本文本的语种是语系语种,将所述样本特征信息输入所述初始语种识别模型的初始卷积网络,从所述语系语种中选取与所述样本文本的语言规律相匹配的语种,作为所述样本文本的语种;
根据预设的第二损失函数计算所述样本文本的语种和所述样本文本的实际语种之间的第二损失值;
当所述第一损失值或所述第二损失值不满足预设条件时,调整所述初始语种识别模型的模型参数,并返回执行将样本集合中的样本文本行图像输入初始语种识别模型的初始语种分类网络,得到所述样本文本行图像中样本文本的样本特征信息的步骤以及后续步骤;
当所述第一损失值和所述第二损失值均满足所述预设条件时,停止训练所述初始语种识别模型,并将所述第一损失值和所述第二损失值均满足所述预设条件时的初始语种识别模型作为所述语种识别模型。
基于第一方面的第二种可能的实现方式,在第三种可能的实现方式中,在所述将样本集合中的样本文本行图像输入初始语种识别模型的初始语种分类网络之前,所述方法还包括:
获取历史样本集合,所述历史样本集合包括所述样本文本行图像和与每个所述样本文本行图像对应的文本标识;
根据预设码表将各个所述文本标识转换为语种编码,得到由所述样本文本行图像和与每个所述样本文本行图像对应的语种编码所组成的样本集合,所述码表包括多个语种,每个所述语种中的每个字符对应至少一个语种编码。
基于第一方面的第二种可能的实现方式,在第四种可能的实现方式中,所述根据预设的第一损失函数计算所述样本文本的语种类别包括的语种和所述样本文本的实际语种之间的第一损失值,包括:
根据连续时序分类损失函数计算所述样本文本的语种和所述样本文本的实际语种之间的第一损失值;
相应的,所述根据预设的第二损失函数计算所述样本文本的语种和所述样本文本的实际语种之间的第二损失值,包括:
根据归一化指数损失函数计算所述样本文本的语种和所述样本文本的实际语种之间的第二损失值。
在第一方面的第五种可能的实现方式中,所述语种识别模型的语种分类网络用于识别所述待识别文本中每个所述字符的语种,并将数量最多的语种作为所述待识别文本的语种。
第二方面,本申请实施例提供了一种语种识别装置,包括:
图像获取模块,用于获取待识别文本行图像,所述待识别文本行图像包括待识别文本;
识别模块,用于将所述待识别文本行图像输入训练后的语种识别模型,得到所述待识别文本的语种,所述语种识别模型用于根据所述待识别文本行图像,确定所述待识别文本的语种,若所述语种为包括多种语种的语系语种,则根据所述待识别文本的语言规律,将所述语系语种中与所述语言规律对应的语种作为所述待识别文本的语种。
在第二方面的第一种可能的实现方式中,所述语种识别模型包括语种分类网络和卷积网络;
所述识别模块,还用于将所述待识别文本行图像输入所述语种分类网络,得到所述待识别文本的特征信息,所述特征信息用于指示所述待识别文本的语种;若所述语种为包括多种语种的语系语种,将所述特征信息输入所述卷积网络,确定所述待识别文本的语言规律,并从所述语系语种中选取与所述语言规律相匹配的语种作为所述待识别文本的语种。
在第二方面的第二种可能的实现方式中,所述装置还包括:
第一训练模块,用于将样本集合中的样本文本行图像输入初始语种识别模型的初始语种分类网络,得到所述样本文本行图像中样本文本的样本特征信息,所述样本特征信息用于指示所述样本文本的语种;
第一计算模块,用于若所述样本文本的语种不是语系语种,根据预设的第一损失函数计算所述样本文本的语种和所述样本文本的实际语种之间的第一损失值;
第二训练模块,用于若所述样本文本的语种是语系语种,将所述样本特征信息输入所述初始语种识别模型的初始卷积网络,从所述语系语种中选取与所述样本文本的语言规律相匹配的语种,作为所述样本文本的语种;
第二计算模块,用于根据预设的第二损失函数计算所述样本文本的语种和所述样本文本的实际语种之间的第二损失值;
调整模块,用于当所述第一损失值或所述第二损失值不满足预设条件时,调整所述初始语种识别模型的模型参数,并返回执行将样本集合中的样本文本行图像输入初始语种识别模型的初始语种分类网络,得到所述样本文本行图像中样本文本的样本特征信息的步骤以及后续步骤;
确定模块,用于当所述第一损失值和所述第二损失值均满足所述预设条件时,停止训练所述初始语种识别模型,并将所述第一损失值和所述第二损失值均满足所述预设条件时的初始语种识别模型作为所述语种识别模型。
基于第二方面的第二种可能的实现方式,在第三种可能的实现方式中,所述装置还包括:
样本获取模块,用于获取历史样本集合,所述历史样本集合包括样本文本行图像和与每个所述样本文本行图像对应的文本标识;
样本生成模块,用于根据预设码表将各个所述文本标识转换为语种编码,得到由所述样本文本行图像和与每个所述样本文本行图像对应的语种编码所组成的样本集合,所述码表包括多个语种,每个所述语种中的每个字符对应至少一个语种编码。
基于第二方面的第二种可能的实现方式,在第四种可能的实现方式中,所述第一 计算模块,还用于根据连续时序分类损失函数计算所述样本文本的语种和所述样本文本的实际语种之间的第一损失值;
相应的,所述第二计算模块,还用于根据归一化指数损失函数计算所述样本文本的语种和所述样本文本的实际语种之间的第二损失值。
在第二方面的第五种可能的实现方式中,所述语种识别模型的语种分类网络用于识别所述待识别文本中每个所述字符的语种,并将数量最多的语种作为所述待识别文本的语种。
第三方面,本申请实施例提供了一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上述第一方面中任一项所述的语种识别方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如上述第一方面中任一项所述的语种识别方法。
第五方面,本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面中任一项所述的语种识别方法。
本申请实施例与现有技术相比存在的有益效果是:
本申请实施例通过获取包括待识别文本的待识别文本行图像,并将待识别文本行图像输入训练后的语种识别模型,通过该语种识别模型确定待识别文本的语种,若该语种为包括多种语种的语系语种,则语种识别模型可以再根据待识别文本的语言规律,将语系语种中与语言规律对应的语种作为待识别文本的语种。基于语言规律识别语种,避免了相同或相似字符的二义性导致识别不准确的问题,提高了语种识别的准确率。
附图说明
图1是本申请提供的语种识别方法所涉及的场景示意图;
图2是本申请实施例提供的手机的部分结构的框图;
图3是本申请提供的一种语种识别方法的示意性流程图;
图4是本申请提供的另一种语种识别方法的示意性流程图;
图5是本申请提供的一种训练语种识别模型的方法的示意性流程图;
图6是本申请实施例提供的一种语种识别装置的结构框图;
图7是本申请实施例提供的另一种语种识别装置的结构框图;
图8是本申请实施例提供的又一种语种识别装置的结构框图。
具体实施方式
以下描述中,为了说明而不是为了限定,提出了诸如特定***结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的***、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。
以下实施例中所使用的术语只是为了描述特定实施例的目的,而并非旨在作为对本申请的限制。如在本申请的说明书和所附权利要求书中所使用的那样,单数表达形式“一个”、“一种”、“所述”、“上述”、“该”和“这一”旨在也包括例如“一个或多个”这种表达形式,除非其上下文中明确地有相反指示。还应当理解,在本申 请实施例中,“一个或多个”是指一个、两个或两个以上;“和/或”,描述关联对象的关联关系,表示可以存在三种关系;例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A、B可以是单数或者复数。
本申请实施例提供的语种识别方法可以应用于手机、平板电脑、可穿戴设备、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本、个人数字助理(personal digital assistant,PDA)等终端设备上,本申请实施例对终端设备的具体类型不作任何限制。
例如,所述终端设备可以是WLAN中的站点(STAION,ST),可以是蜂窝电话、无绳电话、会话启动协议(Session InitiationProtocol,SIP)电话、无线本地环路(Wireless Local Loop,WLL)站、个人数字处理(Personal Digital Assistant,PDA)设备、具有无线通信功能的手持设备、计算设备或连接到无线调制解调器的其它处理设备、电脑、膝上型计算机、手持式通信设备、手持式计算设备、卫星无线设备、无线调制解调器卡、用户驻地设备(customer premise equipment,CPE)和/或用于在无线***上进行通信的其它设备以及下一代通信***,例如,5G网络中的移动终端或者未来演进的公共陆地移动网络(Public Land Mobile Network,PLMN)网络中的移动终端等。
作为示例而非限定,当所述终端设备为可穿戴设备时,该可穿戴设备还可以是应用穿戴式技术对日常穿戴进行智能化设计、开发出可以穿戴的设备的总称,如眼镜、手套、手表、服饰及鞋等。可穿戴设备即直接穿在身上,或是整合到用户的衣服或配件的一种便携式设备。可穿戴设备不仅仅是一种硬件设备,更是通过软件支持以及数据交互、云端交互来实现强大的功能。广义穿戴式智能设备包括功能全、尺寸大、可不依赖智能手机实现完整或者部分的功能,如智能手表或智能眼镜等。
图1是本申请提供的语种识别方法所涉及的场景示意图,如图1所示,该场景包括:终端设备110和待拍摄物品120。
其中,待拍摄物品120可以包括待识别文本,终端设备110可以对该待拍摄物品120进行拍摄,得到包括待识别文本的待识别文本行图像。
而且,终端设备110在运行过程中,可以运行预先训练的得到的语种识别模型,则终端设备110可以通过语种识别模型对待识别图像中的待识别文本进行识别,从而确定待识别文本对应的语种。
另外,待识别图像不但可以是终端设备110拍摄的包括待识别文本的图像,也可以是终端设备110预先存储的包括待识别文本的图像,还可以是通过无线传输获取的图像,本申请实施例对获取待识别图像的方式不做限定。
在一种可能的实现方式中,终端设备110可以获取待识别图像,并将待识别图像输入预先训练的语种识别模型,并通过语种识别模型对待识别文本的各个字符进行识别,从而可以根据多个字符对应的语种,确定待识别文本对应的语种。
需要说明的是,语种识别模型中可以包括语种分类网络和卷积网络,而语种分类网络用于识别待识别文本对应的语种,但是,当语种为包括多种语种的语系语种时,则语种分类网络无法确定待识别文本的语种是语系语种中的哪个语种,则可以通过卷 积网络对待识别文本的语言规律进行学习,并根据语言规律确定待识别文本对应的语种。
另外,本申请实施例中的终端设备110可以为终端人工智能领域的终端设备110,应用于计算机技术领域,终端设备110可以对场景中的文本进行识别,确定文本的语种以及文本对应的内容。例如,终端设备110可以对场景中的英文语句进行识别,确定该文本属于英文,并对该文本进行翻译,得到该英文语句对应的中文语句。
以所述终端设备110为手机为例。图2是本申请实施例提供的手机的部分结构的框图。参考图2,手机包括:射频(Radio Frequency,RF)电路210、存储器220、输入单元230、显示单元240、传感器250、音频电路260、无线保真(wireless fidelity,WiFi)模块270、处理器280、以及电源290等部件。本领域技术人员可以理解,图2中示出的手机结构并不构成对手机的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
下面结合图2对手机的各个构成部件进行具体的介绍:
RF电路210可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,给处理器280处理;另外,将设计上行的数据发送给基站。通常,RF电路包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器(Low Noise Amplifier,LNA)、双工器等。此外,RF电路210还可以通过无线通信与网络和其他设备通信。上述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯***(Global System of Mobile communication,GSM)、通用分组无线服务(General Packet Radio Service,GPRS)、码分多址(Code Division Multiple Access,CDMA)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、长期演进(Long Term Evolution,LTE))、电子邮件、短消息服务(Short Messaging Service,SMS)等。
存储器220可用于存储软件程序以及模块,处理器280通过运行存储在存储器220的软件程序以及模块,从而执行手机的各种功能应用以及数据处理。存储器220可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作***、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器220可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
输入单元230可用于接收输入的数字或字符信息,以及产生与手机的用户设置以及功能控制有关的键信号输入。具体地,输入单元230可包括触控面板231以及其他输入设备232。触控面板231,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板231上或在触控面板231附近的操作),并根据预先设定的程式驱动相应的连接装置。可选的,触控面板231可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器280,并能接收处理器280发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波 等多种类型实现触控面板231。除了触控面板231,输入单元230还可以包括其他输入设备232。具体地,其他输入设备232可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
显示单元240可用于显示由用户输入的信息或提供给用户的信息以及手机的各种菜单。显示单元240可包括显示面板241,可选的,可以采用液晶显示器(Liquid Crystal Display,LCD)、有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板241。进一步的,触控面板231可覆盖显示面板241,当触控面板231检测到在其上或附近的触摸操作后,传送给处理器280以确定触摸事件的类型,随后处理器280根据触摸事件的类型在显示面板241上提供相应的视觉输出。虽然在图2中,触控面板231与显示面板241是作为两个独立的部件来实现手机的输入和输入功能,但是在某些实施例中,可以将触控面板231与显示面板241集成而实现手机的输入和输出功能。
处理器280是手机的控制中心,利用各种接口和线路连接整个手机的各个部分,通过运行或执行存储在存储器220内的软件程序和/或模块,以及调用存储在存储器220内的数据,执行手机的各种功能和处理数据,从而对手机进行整体监控。可选的,处理器280可包括一个或多个处理单元;优选的,处理器280可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作***、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器280中。
手机还包括给各个部件供电的电源290(比如电池),优选的,电源可以通过电源管理***与处理器280逻辑相连,从而通过电源管理***实现管理充电、放电、以及功耗管理等功能。
尽管未示出,手机还可以包括摄像头。可选地,摄像头在手机的上的位置可以为前置的,也可以为后置的,本申请实施例对此不作限定。
可选地,手机可以包括单摄像头、双摄像头或三摄像头等,本申请实施例对此不作限定。
例如,手机可以包括三摄像头,其中,一个为主摄像头、一个为广角摄像头、一个为长焦摄像头。
可选地,当手机包括多个摄像头时,这多个摄像头可以全部前置,或者全部后置,或者一部分前置、另一部分后置,本申请实施例对此不作限定。
另外,尽管未示出,手机还可以包括蓝牙模块等,在此不再赘述。
图3是本申请提供的一种语种识别方法的示意性流程图,作为示例而非限定,该方法可以应用于上述终端设备110中,如图3所示,该方法可以包括:
S301、获取待识别文本行图像,该待识别文本行图像包括待识别文本。
终端设备可以获取包括待识别文本的图像,得到待识别文本行图像,并对待识别文本行图像中的待识别文本进行检测,从而可以根据待识别文本中各个字符所属的语种,确定待识别文本的语种。
在一种可能的实现方式中,终端设备可以根据预先设置的拍摄功能对待识别文本 进行拍摄,得到包括待识别文本的待识别文本行图像。例如,在检测到用户触发的开启拍摄功能的操作后,可以显示拍摄界面,并在该拍摄界面中显示拍摄的待识别文本,若检测到用户触发的拍摄操作,则可以对拍摄界面显示的图像进行存储,得到待检测图像。
当然,还可以通过其他方式获取待识别文本行图像,例如,可以根据用户触发的操作,从终端设备的存储空间中,选取待识别文本行图像,本申请实施例对获取待识别文本行图像的方式不做限定。
S302、将待识别文本行图像输入训练后的语种识别模型,得到待识别文本的语种。
其中,该语种识别模型可以用于根据待识别文本行图像,确定待识别文本的语种,若该语种为包括多种语种的语系语种,则根据待识别文本的语言规律,将语系语种中与语言规律对应的语种作为待识别文本的语种。
在得到待识别文本行图像之后,即可将该待识别文本行图像输入训练后的语种识别模型,从而通过该语种识别模型对待识别文本行图像中的待识别文本进行识别,确定待识别文本的语种,以便在确定待识别文本的语种后,可以根据识别的语种,准确的识别待识别文本中各个字符所对应的文字。
在一种可能的实现方式中,终端设备在获取待识别文本行图像后,可以通过预设的中央处理器或专用神经运算单元,运行该语种识别模型,并将该待识别文本行图像输入语种识别模型,通过语种识别模型中的神经网络对待识别文本行图像中的待识别文本进行检测分析,确定待识别文本的语种。
而且,在确定待识别文本的语种后,终端设备可以在显示屏上显示识别得到的语种。例如,可以在移动终端的显示屏上显示待识别文本行图像,并在待识别文本行图像中,通过框选等方式标识待识别文本,同时,在框选区域附近显示待识别文本的语种。
综上所述,本申请实施例提供的语种识别方法,通过获取包括待识别文本的待识别文本行图像,并将待识别文本行图像输入训练后的语种识别模型,通过该语种识别模型确定待识别文本的语种,若该语种为包括多种语种的语系语种,则语种识别模型可以再根据待识别文本的语言规律,将语系语种中与语言规律对应的语种作为待识别文本的语种。基于语言规律识别语种,避免了相同或相似字符的二义性导致识别不准确的问题,提高了语种识别的准确率。
图4是本申请提供的另一种语种识别方法的示意性流程图,作为示例而非限定,该方法可以应用于上述终端设备110中,如图4所示,该方法可以包括:
S401、获取待识别文本行图像,该待识别文本行图像包括待识别文本。
S402、将待识别文本行图像输入语种分类网络,得到待识别文本的特征信息。
其中,该特征信息用于指示待识别文本的语种。例如,若待识别文本的特征信息为多个字母,则表示待识别文本的语种可以为包括英德意法等多个语种的拉丁语系的语系语种。但是,若待识别文本的特征信息为中文语种中的各个方块字,或者日文语种中的片假文,则表示待识别文本的语种为中文语种或日文语种。
预先训练的语种识别模型可以包括语种分类网络和卷积网络,而语种分类网络和 卷积网络在语种识别的过程中,分别起到不同的作用,则在获取待识别文本行图像之后,可以先将待识别文本行图像输入语种分类网络,确定待识别文本的语种。
在一种可能的实现方式中,可以将待识别文本行图像输入语种分类网络,通过该语种分类网络对待识别文本行图像进行去噪和特征提取等操作,即可根据提取得到待识别文本行图像的特征信息,确定待识别文本中每个字符的语种,并根据每个字符的语种,确定待识别文本的语种。
若识别得到的语种不是语系语种,则可以将该识别得到的语种作为待识别文本的语种。但是,若识别得到的语种是包括多种语种的语系语种,则可以执行S403,通过卷积网络进行进一步地识别,从语系语种包括的多种语种内确定待识别文本的语种。
需要说明的是,语种识别模型的语种分类网络可以用于识别待识别文本中每个字符的语种,并将数量最多的语种作为待识别文本的语种。
相应的,在识别语种的过程中,语种分类网络在确定待识别文本中各个字符的语种后,可以对各个字符的语种进行统计,确定待识别文本中出现的数目最多的语种,也即是,在各个字符的语种中所占比例最大的语种,并将该语种作为待识别文本的语种。
另外,若待识别文本中包括多个语种对应的字符,也可以按照上述方式,将各个字符的语种中所占比例最大的语种,作为待识别文本的语种。
例如,若待识别文本为“瓷器对应的英文单词是china”,待识别文本中10个字符对应的语种为中文语种,5个字符对应的语种为拉丁语系的语系语种,则可以确定该待识别文本的语种为中文语种。
S403、若语种为包括多种语种的语系语种,将特征信息输入卷积网络,确定待识别文本的语言规律,并从语系语种中选取与语言规律相匹配的语种作为待识别文本的语种。
若识别得到的语种为包括多种语种的语系语种,则可以通过语种识别模型的卷积网络对语种分类网络输出的特征信息进行进一步识别,确定待识别文本的语言规律,从而可以根据该语言规律确定待识别文本的语种。
在一种可能的实现方式中,可以将语种分类网络输出的特征信息输入卷积网络中,对于特征信息所指示的每个字符,可以通过卷积网络对该字符和与该字符相邻的其他字符的时序进行学习,得到待识别文本的语言规律,再根据该语言规律对应的语种,从待识别文本的语系语种所包括的多种语种中,选取与该语言规律相匹配的语种作为待识别文本的语种。
例如,若待识别文本为“my name is Zhang San”,则语种分类网络输出的特征信息可以为“m”、“y”、“n”、“a”、“m”、“e”、“i”、“s”、“Z”、“h”、“a”、“n”、“g”、“S”、“a”和“n”,则上述各个字符对应的语种可以为拉丁语系的语系语种。相应的,则可以通过卷积网络对上述各个字符进行卷积操作,识别得到单词“my”、“name”和“is”,并且结合各个单词的语序可以确定该待识别文本的语种为拉丁语系中的英语。
需要说明的是,在实际应用中,本申请实施例中语种识别模型的语种分类网络可以为全卷积网络(Fully Convolutional Networks,FCN),而卷积网络则可以为一维卷 积网络。
综上所述,本申请实施例提供的语种识别方法,通过获取包括待识别文本的待识别文本行图像,并将待识别文本行图像输入训练后的语种识别模型,通过该语种识别模型确定待识别文本的语种,若该语种为包括多种语种的语系语种,则语种识别模型可以再根据待识别文本的语言规律,将语系语种中与语言规律对应的语种作为待识别文本的语种。基于语言规律识别语种,避免了相同或相似字符的二义性导致识别不准确的问题,提高了语种识别的准确率。
进一步地,由全卷积网络构成的语种分类网络,可以快速对待识别文本行图像进行识别,并且可以充分利用待识别文本行图像的行序列信息,减少了识别语种的所花费的时间,提高了识别语种的准确度。
进一步地,由一维卷积网络组成的卷积网络,可以对待识别文本的语言规律进行学习,从而可以根据学习的语言规律从语系语种包括的多个语种中,选取待识别文本的语种,避免了相同或相似字符导致无法准确识别语种的问题,提高了语种识别的准确度。
上述实施例是基于终端设备中的语种识别模型实现的,而语种识别模型可以根据大量的样本文本行图像进行训练得到,参见图5,图5是本申请提供的一种训练语种识别模型的方法的示意性流程图,作为示例而非限定,该方法可以应用于上述终端设备110,或与终端设备110链路连接的服务器中,该方法可以包括:
S501、获取历史样本集合,该历史样本集合包括样本文本行图像和与每个样本文本行图像对应的文本标识。
在训练语种识别模型的过程中,需要根据大量样本数据对建立的初始语种识别模型进行训练,而在获取样本数据的过程中,可以获取历史样本集合,并根据历史样本集合,生成与初始语种识别模型相匹配的样本集合。
其中,历史样本集合可以包括大量的样本文本行图像,而每个样本文本行图像可以对应有文本标识,该文本标识用于指示样本文本行图像中的样本文本。
例如,若样本文本行图像中包括中文文本,则该样本文本行图像对应的文本标识可以指示样本文本行图像中的各个中文字符;若样本文本行图像中包括英文文本,则该样本文本行图像对应的文本标识可以指示样本文本行图像中的各个英文字符。
S502、根据预设码表将各个文本标识转换为语种编码,得到由样本文本行图像和与每个样本文本行图像对应的语种编码所组成的样本集合。
其中,该码表可以包括多个语种,每个语种中的每个字符对应至少一个语种编码。
例如,对于中文、日文和韩文等多个语种,每个语种均包括大量的字符,且各个字符与其他语种的字符区别较大,则可以根据每个语种所包括的大量字符组成字符集合,并建立该字符集合与所属语种之间的对应关系,得到如表1所示的码表,该码表中展示了中文、日文和韩文分别对应的编码cn、ja和ko,且每个编码对应有相应语种的各个字符。
Figure PCTCN2020125591-appb-000001
表1
但是,对于英文、德文和法文等多个语种,每个语种对应的字符数目较少,但是与其他语种的字符类似,则可以将英文、德文和法文等多个语种作为语系语种,并对各个语种的字符进行统一编码。例如,不同语种之间相同的字符均可以与一个语种编码相对应,得到如表2所示的码表,该码表展示了语系语种中俄文和拉丁文之间相同字符和不同字符的编码方式,如表2所示,俄文和拉丁文的字符“A”、“B”和“y”分别可以同时与语种编码中的“A”、“B”和“y”相对应,而俄文中的字符“Б”和“Я”可以单独与语种编码“Б”和“Я”相对应,类似的,拉丁文中的“R”则可以单独与语种编码“R”相对应。
俄文 拉丁文 语种编码
А А A
Б   Б
В B B
У y y
Я   Я
  R R
表2
在S501获取历史样本集合之后,在S502中可以根据预先设置的码表,对历史样本集合中每个样本文本行图像对应的文本标识进行转换,从而可以得到样本文本中每个字符对应的语种编码,进而生成由样本文本行图像和对应的语种编码所组成的样本集合。
S503、将样本集合中的样本文本行图像输入初始语种识别模型的初始语种分类网络,得到样本文本行图像中样本文本的样本特征信息。
其中,样本特征信息用于指示样本文本的语种。
S503的过程与S402的过程类似,在此不再赘述。
S504、若样本文本的语种不是语系语种,根据预设的第一损失函数计算样本文本 的语种和样本文本的实际语种之间的第一损失值。
若识别得到的样本文本的语种不是语系语种,则说明识别得到的语种仅包括一种语种,则可以将样本文本的语种确定为识别得到的语种,无需再根据初始卷积网络对样本文本的语种进行进一步确认。
相应的,可以根据确定的样本文本的语种、以及与样本文本行图像对应的语种编码所指示的实际语种,并结合预先设置的第一损失函数进行计算,得到两个语种之间的第一损失值,以便在后续步骤中,可以根据该第一损失值确定是否需要再次对初始语种识别模型进行训练。也即是,在计算得到第一损失值之后,可以执行S507确定是否需要继续对初始语种识别模型进行训练。
进一步地,在实际应用中,可以根据连续时序分类损失函数(Connectionist Temporal Classification Loss,CTCLoss)计算样本文本的语种和样本文本的实际语种之间的第一损失值。
但是,需要说明的是,若识别得到的样本文本的语种为语系语种,则可以执行S505,以通过初始卷积网络对样本文本的语种进行进一步识别。
S505、若样本文本的语种是语系语种,将样本特征信息输入初始语种识别模型的初始卷积网络,从语系语种中选取与样本文本的语言规律相匹配的语种,作为样本文本的语种。
S505的过程与S403的过程类似,在此不再赘述。
S506、根据预设的第二损失函数计算样本文本的语种和样本文本的实际语种之间的第二损失值。
S506的过程与S504的过程类似,在此不再赘述。
需要说明的是,第二损失函数可以为归一化指数损失函数(SoftMaxLoss),则可以根据归一化指数损失函数计算样本文本的语种和样本文本的实际语种之间的第二损失值。
S507、当第一损失值或第二损失值不满足预设条件时,调整初始语种识别模型的模型参数,并返回执行将样本集合中的样本文本行图像输入初始语种识别模型的初始语种分类网络,得到样本文本行图像中样本文本的样本特征信息的步骤以及后续步骤。
在计算得到第一损失值或第二损失值之后,可以判断第一损失值或第二损失值是否满足预先设置的预设条件,若不满足预设条件,则说明初始语种识别模型并未收敛,需要再次对初始语种识别模型进行训练,直至满足预设条件。
在一种可能的实现方式中,可以将第一损失值或第二损失值,与预先设定的与第一损失函数或第二损失函数相匹配的损失阈值进行比较,判断第一损失值或第二损失值是否小于或等于相对应的损失阈值。
若第一损失值或第二损失值大于相对应的损失阈值时,说明第一损失值或第二损失值不满足预设条件,则可以根据第一损失值或第二损失值对初始语种识别模型的参数进行调整,并再次执行S503、S504和S507,或者再次执行S503和S505至S507,也即是,将样本文本行图像输入调整模型参数后的初始语种识别模型中,从而根据计算再次得到的第一损失值或第二损失值对初始语种识别模型进行调整训练,直至第一损失值和第二损失值均满足预设条件。
S508、当第一损失值和第二损失值均满足预设条件时,停止训练初始语种识别模型,并将第一损失值和第二损失值均满足该预设条件时的初始语种识别模型作为语种识别模型。
若第一损失值和第二损失值均满足预设条件,则说明初始语种识别模型开始收敛,可以停止对初始语种识别模型进行训练,并将当前第一损失值和第二损失值均满足预设条件的初始语种识别模型作为语种识别模型。
在一种可能的实现方式中,若计算得到第一损失值,且该第一损失值满足预设条件,则可以再次执行S503和S505至S507,若此次计算得到的第二损失值也满足预设条件,则可以停止对初始语种识别模型的训练,并将当前的初始语种识别模型作为语种识别模型。
需要说明的是,在实际应用中,在确定第一损失值满足预设条件后,再次训练确定第二损失值是否满足预设条件的过程中,可能会再次得到第一损失值,而在确定第二损失值满足预设条件之前计算得到的每个第一损失值均满足预设条件,则可以确定第一损失值和第二损失值均满足预设条件。
但是,若在确定第二损失值满足预设条件之前计算得到的任一第一损失值不满足预设条件,则不能确定第一损失值和第二损失值均满足预设条件。
类似的,若计算得到的第二损失值先满足预设条件,则可以再执行S503、S504和S507,确定再次计算的第一损失值是否满足预设条件,若满足预设条件,则可以停止对初始语种识别模型的训练,并将当前的初始语种识别模型作为语种识别模型。
而且,在确定第一损失值满足预设条件之前,若每次得到的第二损失值均满足预设条件,则可以确定第一损失值和第二损失值均满足预设条件。但是,若在确定第一损失值满足预设条件之前,计算得到的任一第二损失值不满足预设条件,则不能确定第一损失值和第二损失值均满足预设条件。
综上所述,本申请实施例提供的训练语种识别模型的方法,通过获取历史样本集合,并根据预先设置的码表对历史样本集合中的文本标识进行转换,得到样本集合,再通过样本集合对初始语种识别模型进行训练,得到第一损失值和第二损失值均满足预设条件的语种识别模型,通过语种识别模型对待识别文本行图像中的待识别文本进行识别,可以基于语言规律识别语种,避免了相同或相似字符的二义性导致识别不准确的问题,提高了识别语种的准确度。
进一步地,通过各个语种的字符之间的相似度设置码表,可以避免相同或相似字符的二义性问题,并减少语种识别模型最后一层神经网络的参数量,可以得到占用存储空间更小、识别速度更快的语种识别模型。
进一步地,采用历史样本集合获取训练初始语种识别模型的样本集合,无需生成样本数据,减少了训练初始语种识别模型的成本。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
对应于上文实施例所述的语种识别方法,图6是本申请实施例提供的一种语种识别装置的结构框图,为了便于说明,仅示出了与本申请实施例相关的部分。
参照图6,该装置包括:
图像获取模块601,用于获取待识别文本行图像,该待识别文本行图像包括待识别文本;
识别模块602,用于将该待识别文本行图像输入训练后的语种识别模型,得到该待识别文本的语种,该语种识别模型用于根据该待识别文本行图像,确定该待识别文本的语种,若该语种为包括多种语种的语系语种,则根据该待识别文本的语言规律,将该语系语种中与该语言规律对应的语种作为该待识别文本的语种。
可选的,该语种识别模型包括语种分类网络和卷积网络;
该识别模块602,还用于将该待识别文本行图像输入该语种分类网络,得到该待识别文本的特征信息,该特征信息用于指示该待识别文本的语种;若该语种为包括多种语种的语系语种,将该特征信息输入该卷积网络,确定该待识别文本的语言规律,并从该语系语种中选取与该语言规律相匹配的语种作为该待识别文本的语种。
可选的,参见图7,该装置还包括:
第一训练模块603,用于将样本集合中的样本文本行图像输入初始语种识别模型的初始语种分类网络,得到该样本文本行图像中样本文本的样本特征信息,该样本特征信息用于指示该样本文本的语种;
第一计算模块604,用于若该样本文本的语种不是语系语种,根据预设的第一损失函数计算该样本文本的语种和该样本文本的实际语种之间的第一损失值;
第二训练模块605,用于若该样本文本的语种是语系语种,将该样本特征信息输入该初始语种识别模型的初始卷积网络,从该语系语种中选取与该样本文本的语言规律相匹配的语种,作为该样本文本的语种;
第二计算模块606,用于根据预设的第二损失函数计算该样本文本的语种和该样本文本的实际语种之间的第二损失值;
调整模块607,用于当该第一损失值或该第二损失值不满足预设条件时,调整该初始语种识别模型的模型参数,并返回执行将样本集合中的样本文本行图像输入初始语种识别模型的初始语种分类网络,得到该样本文本行图像中样本文本的样本特征信息的步骤以及后续步骤;
确定模块608,用于当该第一损失值和该第二损失值均满足该预设条件时,停止训练该初始语种识别模型,并将该第一损失值和该第二损失值均满足该预设条件时的初始语种识别模型作为该语种识别模型。
可选的,参见图8,该装置还包括:
样本获取模块609,用于获取历史样本集合,该历史样本集合包括样本文本行图像和与每个该样本文本行图像对应的文本标识;
样本生成模块610,用于根据预设码表将各个该文本标识转换为语种编码,得到由该样本文本行图像和与每个该样本文本行图像对应的语种编码所组成的样本集合,该码表包括多个语种,每个该语种中的每个字符对应至少一个语种编码。
可选的,该第一计算模块604,还用于根据连续时序分类损失函数计算该样本文 本的语种和该样本文本的实际语种之间的第一损失值;
相应的,该第二计算模块606,还用于根据归一化指数损失函数计算该样本文本的语种和该样本文本的实际语种之间的第二损失值。
可选的,该语种识别模型的语种分类网络用于识别该待识别文本中每个该字符的语种,并将数量最多的语种作为该待识别文本的语种。
综上所述,本申请实施例提供的语种识别装置,通过获取包括待识别文本的待识别文本行图像,并将待识别文本行图像输入训练后的语种识别模型,通过该语种识别模型确定待识别文本的语种,若该语种为包括多种语种的语系语种,则语种识别模型可以再根据待识别文本的语言规律,将语系语种中与语言规律对应的语种作为待识别文本的语种。基于语言规律识别语种,避免了相同或相似字符的二义性导致识别不准确的问题,提高了语种识别的准确率。
本申请实施例还提供一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如图3至图5对应的实施例中任一项所述的语种识别方法。
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如图3至图5对应的实施例中任一项所述的语种识别方法。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述***中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的***实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个 单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质至少可以包括:能够将计算机程序代码携带到终端设备的任何实体或装置、记录介质、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质。例如U盘、移动硬盘、磁碟或者光盘等。在某些司法管辖区,根据立法和专利实践,计算机可读介质不可以是电载波信号和电信信号。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (10)

  1. 一种语种识别方法,其特征在于,包括:
    获取待识别文本行图像,所述待识别文本行图像包括待识别文本;
    将所述待识别文本行图像输入训练后的语种识别模型,得到所述待识别文本的语种,所述语种识别模型用于根据所述待识别文本行图像,确定所述待识别文本的语种,若所述语种为包括多种语种的语系语种,则根据所述待识别文本的语言规律,将所述语系语种中与所述语言规律对应的语种作为所述待识别文本的语种。
  2. 如权利要求1所述的方法,其特征在于,所述语种识别模型包括语种分类网络和卷积网络;
    所述将所述待识别文本行图像输入训练后的语种识别模型,得到所述待识别文本的语种,包括:
    将所述待识别文本行图像输入所述语种分类网络,得到所述待识别文本的特征信息,所述特征信息用于指示所述待识别文本的语种;
    若所述语种为包括多种语种的语系语种,将所述特征信息输入所述卷积网络,确定所述待识别文本的语言规律,并从所述语系语种中选取与所述语言规律相匹配的语种作为所述待识别文本的语种。
  3. 如权利要求1所述的方法,其特征在于,在所述将所述待识别文本行图像输入训练后的语种识别模型之前,所述方法还包括:
    将样本集合中的样本文本行图像输入初始语种识别模型的初始语种分类网络,得到所述样本文本行图像中样本文本的样本特征信息,所述样本特征信息用于指示所述样本文本的语种;
    若所述样本文本的语种不是语系语种,根据预设的第一损失函数计算所述样本文本的语种和所述样本文本的实际语种之间的第一损失值;
    若所述样本文本的语种是语系语种,将所述样本特征信息输入所述初始语种识别模型的初始卷积网络,从所述语系语种中选取与所述样本文本的语言规律相匹配的语种,作为所述样本文本的语种;
    根据预设的第二损失函数计算所述样本文本的语种和所述样本文本的实际语种之间的第二损失值;
    当所述第一损失值或所述第二损失值不满足预设条件时,调整所述初始语种识别模型的模型参数,并返回执行将样本集合中的样本文本行图像输入初始语种识别模型的初始语种分类网络,得到所述样本文本行图像中样本文本的样本特征信息的步骤以及后续步骤;
    当所述第一损失值和所述第二损失值均满足所述预设条件时,停止训练所述初始语种识别模型,并将所述第一损失值和所述第二损失值均满足所述预设条件时的初始语种识别模型作为所述语种识别模型。
  4. 如权利要求3所述的方法,其特征在于,在所述将样本集合中的样本文本行图像输入初始语种识别模型的初始语种分类网络之前,所述方法还包括:
    获取历史样本集合,所述历史样本集合包括所述样本文本行图像和与每个所述样 本文本行图像对应的文本标识;
    根据预设码表将各个所述文本标识转换为语种编码,得到由所述样本文本行图像和与每个所述样本文本行图像对应的语种编码所组成的样本集合,所述码表包括多个语种,每个所述语种中的每个字符对应至少一个语种编码。
  5. 如权利要求3所述的方法,其特征在于,所述根据预设的第一损失函数计算所述样本文本的语种类别包括的语种和所述样本文本的实际语种之间的第一损失值,包括:
    根据连续时序分类损失函数计算所述样本文本的语种和所述样本文本的实际语种之间的第一损失值;
    相应的,所述根据预设的第二损失函数计算所述样本文本的语种和所述样本文本的实际语种之间的第二损失值,包括:
    根据归一化指数损失函数计算所述样本文本的语种和所述样本文本的实际语种之间的第二损失值。
  6. 如权利要求1至5任一所述的方法,其特征在于,所述语种识别模型的语种分类网络用于识别所述待识别文本中每个所述字符的语种,并将数量最多的语种作为所述待识别文本的语种。
  7. 一种语种识别装置,其特征在于,包括:
    图像获取模块,用于获取待识别文本行图像,所述待识别文本行图像包括待识别文本;
    识别模块,用于将所述待识别文本行图像输入训练后的语种识别模型,得到所述待识别文本的语种,所述语种识别模型用于根据所述待识别文本行图像,确定所述待识别文本的语种,若所述语种为包括多种语种的语系语种,则根据所述待识别文本的语言规律,将所述语系语种中与所述语言规律对应的语种作为所述待识别文本的语种。
  8. 如权利要求7所述的装置,其特征在于,所述语种识别模型包括语种分类网络和卷积网络;
    所述识别模块,还用于将所述待识别文本行图像输入所述语种分类网络,得到所述待识别文本的特征信息,所述特征信息用于指示所述待识别文本的语种;若所述语种为包括多种语种的语系语种,将所述特征信息输入所述卷积网络,确定所述待识别文本的语言规律,并从所述语系语种中选取与所述语言规律相匹配的语种作为所述待识别文本的语种。
  9. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至6任一项所述的方法。
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至6任一项所述的方法。
PCT/CN2020/125591 2019-11-22 2020-10-30 语种识别方法、装置、终端设备及计算机可读存储介质 WO2021098490A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911158357.1 2019-11-22
CN201911158357.1A CN111027528B (zh) 2019-11-22 2019-11-22 语种识别方法、装置、终端设备及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2021098490A1 true WO2021098490A1 (zh) 2021-05-27

Family

ID=70203211

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/125591 WO2021098490A1 (zh) 2019-11-22 2020-10-30 语种识别方法、装置、终端设备及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN111027528B (zh)
WO (1) WO2021098490A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113470617A (zh) * 2021-06-28 2021-10-01 科大讯飞股份有限公司 语音识别方法以及电子设备、存储装置
CN113782000A (zh) * 2021-09-29 2021-12-10 北京中科智加科技有限公司 一种基于多任务的语种识别方法

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111027528B (zh) * 2019-11-22 2023-10-03 华为技术有限公司 语种识别方法、装置、终端设备及计算机可读存储介质
CN111539207B (zh) * 2020-04-29 2023-06-13 北京大米未来科技有限公司 文本识别方法、文本识别装置、存储介质和电子设备
CN111832657A (zh) * 2020-07-20 2020-10-27 上海眼控科技股份有限公司 文本识别方法、装置、计算机设备和存储介质
CN112329454A (zh) * 2020-11-03 2021-02-05 腾讯科技(深圳)有限公司 语种识别方法、装置、电子设备及可读存储介质
CN112528682A (zh) * 2020-12-23 2021-03-19 北京百度网讯科技有限公司 语种检测方法、装置、电子设备和存储介质
CN113822275A (zh) * 2021-09-27 2021-12-21 北京有竹居网络技术有限公司 一种图像语种识别方法及其相关设备
CN114462397B (zh) * 2022-01-20 2023-09-22 连连(杭州)信息技术有限公司 一种语种识别模型训练方法、语种识别方法、装置及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020198712A1 (en) * 2001-06-12 2002-12-26 Hewlett Packard Company Artificial language generation and evaluation
CN105760901A (zh) * 2016-01-27 2016-07-13 南开大学 一种多语种倾斜文档图像的自动语言判别方法
CN107957994A (zh) * 2017-10-30 2018-04-24 努比亚技术有限公司 一种翻译方法、终端及计算机可读存储介质
CN110070853A (zh) * 2019-04-29 2019-07-30 盐城工业职业技术学院 一种语音识别转化方法及***
CN111027528A (zh) * 2019-11-22 2020-04-17 华为技术有限公司 语种识别方法、装置、终端设备及计算机可读存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102890783B (zh) * 2011-07-20 2015-07-29 富士通株式会社 识别图像块中文字的方向的方法和装置
US10388270B2 (en) * 2014-11-05 2019-08-20 At&T Intellectual Property I, L.P. System and method for text normalization using atomic tokens
US10043231B2 (en) * 2015-06-30 2018-08-07 Oath Inc. Methods and systems for detecting and recognizing text from images
CN107256378A (zh) * 2017-04-24 2017-10-17 北京航空航天大学 语种识别方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020198712A1 (en) * 2001-06-12 2002-12-26 Hewlett Packard Company Artificial language generation and evaluation
CN105760901A (zh) * 2016-01-27 2016-07-13 南开大学 一种多语种倾斜文档图像的自动语言判别方法
CN107957994A (zh) * 2017-10-30 2018-04-24 努比亚技术有限公司 一种翻译方法、终端及计算机可读存储介质
CN110070853A (zh) * 2019-04-29 2019-07-30 盐城工业职业技术学院 一种语音识别转化方法及***
CN111027528A (zh) * 2019-11-22 2020-04-17 华为技术有限公司 语种识别方法、装置、终端设备及计算机可读存储介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113470617A (zh) * 2021-06-28 2021-10-01 科大讯飞股份有限公司 语音识别方法以及电子设备、存储装置
CN113470617B (zh) * 2021-06-28 2024-05-31 科大讯飞股份有限公司 语音识别方法以及电子设备、存储装置
CN113782000A (zh) * 2021-09-29 2021-12-10 北京中科智加科技有限公司 一种基于多任务的语种识别方法
CN113782000B (zh) * 2021-09-29 2022-04-12 北京中科智加科技有限公司 一种基于多任务的语种识别方法

Also Published As

Publication number Publication date
CN111027528A (zh) 2020-04-17
CN111027528B (zh) 2023-10-03

Similar Documents

Publication Publication Date Title
WO2021098490A1 (zh) 语种识别方法、装置、终端设备及计算机可读存储介质
WO2021135601A1 (zh) 辅助拍照方法、装置、终端设备及存储介质
US10169639B2 (en) Method for fingerprint template update and terminal device
US11290447B2 (en) Face verification method and device
WO2018113512A1 (zh) 图像处理方法以及相关装置
WO2018177071A1 (zh) 车牌号码匹配方法及装置、字符信息匹配方法及装置
WO2019233216A1 (zh) 一种手势动作的识别方法、装置以及设备
US10599913B2 (en) Face model matrix training method and apparatus, and storage medium
CN110033769B (zh) 一种录入语音处理方法、终端及计算机可读存储介质
CN111612093A (zh) 一种视频分类方法、视频分类装置、电子设备及存储介质
WO2016015471A1 (zh) 一种预测用户离网的方法及装置
CN108920080A (zh) 指纹扫描方法、移动终端及存储介质
CN110162956B (zh) 确定关联账户的方法和装置
CN108039995A (zh) 消息发送控制方法、终端及计算机可读存储介质
CN108537021A (zh) 基于双面屏的指纹解锁方法、终端及计算机可读存储介质
CN111383198B (zh) 图像处理方法及相关产品
CN108919977A (zh) 信息输入方法、终端及计算机可读存储介质
CN111104967A (zh) 图像识别网络训练方法、图像识别方法、装置及终端设备
CN112329926A (zh) 智能机器人的质量改善方法及***
CN110083742B (zh) 一种视频查询方法和装置
CN108153477B (zh) 多点触摸操作方法、移动终端及计算机可读存储介质
CN110866114B (zh) 对象行为的识别方法、装置及终端设备
CN109275112A (zh) 短信处理方法、服务器及计算机可读存储介质
CN109740121B (zh) 一种移动终端的搜索方法、移动终端及存储介质
CN109918348B (zh) 应用浏览记录的清理方法、终端及计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20889735

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20889735

Country of ref document: EP

Kind code of ref document: A1