WO2021237227A1

WO2021237227A1 - Method and system for multi-language text recognition model with autonomous language classification

Info

Publication number: WO2021237227A1
Application number: PCT/US2021/040137
Authority: WO
Inventors: Kaiyu ZHANG; Yuan Lin
Original assignee: Innopeak Technology, Inc.
Priority date: 2021-07-01
Filing date: 2021-07-01
Publication date: 2021-11-25

Abstract

Systems and methods are provided for implementing multi-language scene text recognition. Particularly, the system and method can improve automated text recognition applications by autonomously recognizing characters in text, and a language of origin for the text. Additionally, a multi-language text recognition model is employed, which applies deep learning algorithms to accurately detect multiple languages using the one model. Therefore, the system and method can achieve an efficient, accurate, and seamless integration of autonomous language detection and character recognition for multiple languages using a single model. A method can involve extracting visual features corresponding to textual content of an input image, where the input image comprises textual content and non-textual context. The extracted features can be encoded to map each visual feature with a character to recognize the textual content. Further, a language for the recognized text can be autonomously recognized based on index values corresponding to the characters.

Description

METHOD AND SYSTEM FOR MULTI-LANGUAGE TEXT RECOGNITION MODEL WITH AUTONOMOUS LANGUAGE CLASSIFICATION

Description of Related Art

[0001] The present application generally relates to artificial intelligence, particularly to methods and systems for using deep learning techniques to perform multi-lingual optical character recognition (OCR) on images having textual content.

Brief Description of the Drawings

[0002] The present disclosure, in accordance with one or more various embodiments, is described in detail with reference to the following figures. The figures are provided for purposes of illustration only and merely depict typical or example embodiments.

[0003] FIG. 1 depicts an example computing system, such as a mobile computing device, implementing an optical character recognition (OCR) application, in accordance with embodiments of the application.

[0004] FIG. 2 is an example data processing environment for implementing multi language text recognition and autonomous language classification, in accordance with some embodiments.

[0005] FIG. 3 is a block diagram illustrating a data processing system for implementing the multi-language text recognition and autonomous language classification, in accordance with some embodiments.

[0006] FIG. 4 is another example data processing system for training and applying a neural network based multi-language text recognition model for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments. [0007] FIG. 5 is an example architecture for the multi-language text recognition model for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments.

[0008] FIG. 6 illustrates a simplified process of extracting textual information from an input image for a multi-language text recognition model for OCR, in accordance with some embodiments.

[0009] FIG. 7 illustrates a simplified process of feature encoding using bidirectional long short memory term (BiLSTM) in the multi-language text recognition model for OCR, in accordance with some embodiments.

[0010] FIG. 8 is an operational flow diagram illustrating an example of an OCR process for recognizing multiple languages of text and autonomous language classification, in accordance with some embodiments.

[0011] FIG. 9 is a block diagram of an example computing component or device for implementing the disclosed multicast resource management techniques, in accordance with the disclosure.

[0012] The figures are not exhaustive and do not limit the present disclosure to the precise form disclosed.

Detailed Description

[0013] Embodiments of the application provide a distinct method and system for multi-language scene text recognition. Particularly, the disclosed system and techniques improve automated text recognition applications, such as optical Character Recognition (OCR), by autonomously recognizing a language of origin for the text (also referred to herein as autonomous language classification) in addition to recognizing the characters in the text. Additionally, the disclosed autonomous language classification utilizes a multi-language text recognition model. The multi-language text recognition model, as disclosed herein, applies deep learning algorithms in a manner that can accurately detect multiple languages using the one model. Therefore, the disclosed autonomous language classification techniques achieve an efficient, accurate, and seamless integration of autonomous language detection and character recognition for multiple languages in a single model.

[0014] OCR, as referred to herein, is the electronic or mechanical conversion of images of typed, handwritten, or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo or from subtitle text superimposed on an image (which is an important technical method for extracting information from images). OCR is widely used in industry with respect to searching, positioning, translation, recommendation, and the like, thus OCR having a wide range of real-world applications and great commercial value. For example, a well-known use case for OCR is converting printed paper documents into machine-readable text documents. After a scanned paper document goes through OCR processing, the text of the document can be extracted and entered into a word processing application in a manner that allows for easy editing, transmission, etc.

[0015] Since the accuracy of text recognition using OCR has improved greatly since the technique was first available, OCR continues to emerge as an important technique for extracting information from different type of images and for different use cases. Particularly, OCR technology is starting to become more relevant in the realm of language detection. Recently, deep learning algorithms have been used to implement language detection models in OCR. However, these deep learning algorithms are normally developed for individual languages. When multiple languages are involved, a deep learning algorithm can grow in size drastically and demand a large amount of computational resources and training data. In addition, many existing deep learning algorithms that are applied for language detection employ an isolate model for this task. The isolate model causes the resulting language detection approaches to have a very low accuracy. Further, in many situations, training data is limited for some languages that are not frequently applied, rendering these currently used deep learning algorithms for language detection in those languages inaccurate.

[0016] Moreover, many of the currently existing OCR-based language detection applications (employing the aforementioned deep learning approaches) have two key drawbacks: (1) the input language is required to be known; and (2) one deep learning model can only be used for one language, as alluded to above. These limitations can negatively impact the overall performance of the current language detection approaches. For example, a person using OCR for a language translation may be traveling internationally for the first time and may not know the specific language of origin in the region. Thus, in these instances where the input language is not known, the existing language detection approaches (which require a known input language) cannot be used. Furthermore, for the existing OCR-based language detection approaches to support multiple languages, it will also be required to implement multiple deep learning models. That is, when a single model corresponds to a single language, as the number of languages that are supported increases, the number of models needed will proportionally increase. Consequently, even if a robust multi-language OCR application that supports a wide-range of languages is implemented using the existing language detection approaches, this OCR application would inefficiently consume memory and/or computational resources to the point where it would not be optimal for many resource-limited devices, such a mobile computer devices and mobile phone (e.g., smartphone). In order to address the drawbacks of the existing OCR language technology, the disclosed embodiments realize an OCR text recognition model that has a reasonable size for increased efficiency, while having the capability to accurately recognize multiple languages.

[0017] Referring now to FIG. 1, an example of a computing system 100 implementing an OCR application 150, including the disclosed scene text recognition (STR) (also referred to herein as text recognition) and autonomous language classification features is shown. In this illustration, the depicted computing system 100 can be a handheld mobile user device, namely a smartphone (e.g., application telephone) that includes a touchscreen display device 130 for presenting content to a user of the computing system 100 and receiving touch- based user inputs. Other visual, tactile, and auditory output components may also be provided (e.g., LED lights, a vibrating mechanism for tactile output, or a speaker for providing tonal, voice-generated, or recorded output), as may various different input components (e.g., keyboard, physical buttons, trackballs, accelerometers, gyroscopes, and magnetometers). [0018] Example visual output mechanism in the form of display device 130 may take the form of a display with resistive or capacitive touch capabilities. The display device 130 may be for displaying video, graphics, images, and text, and for coordinating user touch input locations with the location of displayed information so that the device 130 can associate user contact at a location of a displayed item with the item. The computing device 100 may also take alternative forms, including as a laptop computer, a tablet or slate computer, a personal digital assistant (PDA), an embedded system (e.g., a car navigation system), a desktop personal computer, or a computerized workstation.

[0019] An operating system may provide an interface between the hardware of the computing system 100 (e.g., the input/output mechanisms and a processor executing instructions retrieved from computer-readable medium), such as processors 160, and software, such as the OCR application 150. Example operating systems include ANDROID, CHROME, IOS, MAC OS X, WINDOWS 7, WINDOWS PHONE 7, SYMBIAN, BLACKBERRY, WEBOS— a variety of UNIX operating systems; or a proprietary operating system for computerized devices. The operating system may provide a platform for the execution of application programs, such as the OCR application 150, that facilitates interaction between the computing device 100 and a user.

[0020] The computing system 100 may present a graphical user interface (GUI) on the display device 130 (e.g., touchscreen). For example, as seen in FIG. 1, the GUI can display results of the scene text recognition and autonomous language classification features generated by the OCR application 150. As referred to herein, a GUI is a collection of one or more graphical interface elements and may be static (e.g., the display appears to remain the same over a period of time), or may be dynamic (e.g., the graphical user interface includes graphical interface elements that animate without user input). As illustrated in FIG. 1, the GUI can render an image 120 including some textual content. In the example of FIG. 1 the image 120 is associated with a downloadable application, illustrated as "Dance App." In another example that is also described herein, an image can be obtained from an image capturing device of the computing system 100, such as an embedded digital camera. In other words, a user can employ the computing system 100 to capture an image (e.g., digital photograph) having text or can download an image (e.g., software application) having text that can be recognized by the OCR application 150 in a manner that is visible and interactive for the user.

[0021] According to the embodiments, the OCR application 150 is distinctly designed to automatically perform STR and multi-language detection. FIG. 1 illustrates an example of results of the STR and multi-language detection, as output to the user on the display device 130. In FIG. 1, image 120 is illustrated as text that is related to the downloadable application "Dance App" . The text is in a particular language, namely English, and describes various details of the "Dance App" that may be informative to the users. In the example, the text displayed within image 120 includes information associated with the "Dance App", such as the recommended ages (or age restriction) for users of the application, the user rating of the application, current version of the application, and the like. According to the embodiments, the OCR application 150 can automatically recognize the text that is in the image 120 (distinguishing the text content from the background and/or non-textual content in the image 120) and automatically classify the language corresponding to the text. As previously described, a key feature of the disclosed embodiments includes the distinct multi language text recognition model implemented by the OCR application 150, which utilizes neural network modeling in a mannerthat does not require that the language of the input be known prior to performing text recognition. Thus, even if the user of the computing system does know any of the languages for the text in the scene of image 120, the OCR application 150 can still be useful. According to the embodiments, the OCR application 150 can recognize the text that is in the image 120 and can autonomously output the respective language for each of the lines of text in the image 120 to the user.

[0022] Also, FIG. 1 shows that the text recognition features implemented by the OCR application 150 can involve displaying bounding boxes, such as bounding box 122, around areas of the detected text within the image 120. As an example, the portion of image 120 that includes text 121 "Dance App - Make Your Day" is surrounded by bounding box 122. The OCR application 150 can further analyze the text of the image 120 that is within the bounding boxes, for instance analyzing the text 121 that is within the bounding box 122. Greater details of the analysis implemented by the OCR application 150, for example utilizing the multi language text recognition model, are discussed below in reference to FIG. 5 - FIG. 8. As a result of the analysis, the OCR application 150 can also classify the language corresponding to the detected text. FIG. 1 shows an example of a displayed result 123 that may be generated by the OCR application 150 based on the STR and multi-language detection that it performs. In other words, the OCR application 150 may generate a result of the text detection and analysis, such as the result 123, in a manner that is visible to the user. In the example of FIG. 1, there is a result that corresponds to each bounding box in image 120, and in particular result 123 corresponds to bounding box 122. As seen in FIG. 1, the result 123 is English language text "Dance App - Make Your Day," which is a composite of the characters of the detected text 121, which is also in English. Accordingly, a user can view a displayed result, such as result 123, which shows them the text and language that is detected within the image 120.

[0023] FIG. 2 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, or intelligent, multi-sensing, network connected home devices (e.g., a camera). Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provides system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104. [0024] The one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102, and implement some data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In another example, the client devices 104 include a networked surveillance camera and a mobile phone 104C. The networked surveillance camera collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera in the real time and remotely.

[0025] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM

Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.

[0026] Deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video, image, audio, or textual data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data. In these deep learning techniques, data processing models, such as the multi-language text recognition model, are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data. In some embodiments, both model training and data processing are implemented locally at each individual client device 104 (e.g., the client device 104C). The client device 104C obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models. Subsequently to model training, the client device 104C obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the training data processing models locally. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A). The server 102A obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The client device 104A obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results from the server 102A, and presents the results on a user interface (e.g., associated with the application). The client device 104A itself implements no or little data processing on the content data prior to sending them to the server 102A. Additionally, in some embodiments, data processing is implemented locally at a client device 104 (e.g., the client device 104B), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104B. The server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The trained data processing models are optionally stored in the server 102B or storage 106. The client device 104B imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface locally.

[0027] FIG. 3 is a block diagram illustrating a data processing system 200, in accordance with some embodiments. The data processing system 200 includes a server 102, a client device 104, a storage 106, or a combination thereof. The data processing system 200, typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The data processing system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice- command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the data processing system 200 uses a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more cameras, scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The data processing system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays. Optionally, the client device 104 includes a location detection device, such as a GPS (global positioning satellite) or other geo location receiver, for determining the location of the client device 104.

[0028] Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other nonvolatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:

• Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;

• Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;

• User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.); • Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;

• Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;

• One or more user applications 224 for execution by the data processing system 200 (e.g., games, social network applications, smart home applications, and/or other web or nonweb based applications for controlling another electronic device and reviewing data captured by such devices);

• Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;

• Data processing module 228 for processing content data using text recognition 240, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224;

• One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 238 for training one or more text recognition models 240; o Text recognition (s) 240 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques; and o Content data and results 242 that are obtained by and outputted to the client device 104 of the data processing system 200, respectively, where the content data is processed by the text recognition models 240 locally at the client device 104 or remotely at the server 102 to provide the associated results 242 to be presented on client device 104.

[0029] Optionally, the one or more databases 230 are stored in one of the server 102, client device 104, and storage 106 of the data processing system 200. Optionally, the one or more databases 230 are distributed in more than one of the server 102, client device 104, and storage 106 of the data processing system 200. In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the text recognition model 240 are stored at the server 102 and storage 106, respectively.

[0030] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or

IB otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above.

[0031] FIG. 4 is another example data processing system 300 for training and applying a neural network based (NN-based) multi-language text recognition model 240 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments. The data processing system 300 includes a model training module 226 for establishing the multi-language text recognition model 240 and a data processing module 228 for processing the content data using the multi-language text recognition model 240. In some embodiments, both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104. The training data source 304 is optionally a server 102 or storage 106. Alternatively, in some embodiments, both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300. The training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106. Additionally, in some embodiments, the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained multi-language text recognition model 240 to the client device 104.

[0032] The model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312. The multi-language text recognition240 is trained according to a type of the content data to be processed. The training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data. For example, an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size. Alternatively, an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform. The model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing multi-language text recognition240 and generates an output from each training data item. During this course, the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item. The model training engine 310 modifies the multi-language text recognition model 240 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold). The modified multi-language text recognition model 240 is provided to the data processing module 228 to process the content data.

[0033] In some embodiments, the loss control module 312 is configured to train the one or more neural networks by minimizing a Connectionist Temporal Classification (CTC) based loss. In general, CTC is employed because this loss function can be applied in scenarios where the alignment between the input and the output is not known, which is often the case in the realm of language recognition and/or speech recognition. The CTC algorithm can assign a probability for any Y given an X. The CTC algorithm is alignment-free —meaning the function does not require an alignment between the input and the output. However, to get the probability of an output given an input, CTC can sum over the probability of all possible alignments between the input and the output. This probability computed by the CTC can be represented mathematically as: p{Y\X) = åAEAx,y n _l pt(at \X) (eq. 1)

[0034] Determining the probability of the alignments dictates how the loss function is ultimately calculated. Thus, the CTC alignments goes from probabilities at each time-step to the probability of an output sequence. After the neural network is trained using the loss function, such as CTC, by the neural network can be used to find a likely output for a given input. More precisely, the multi-language text recognition model 240 can solve:

Y* = argmax p{Y\X) (eq. 2)

[0035] One heuristic is to select the most likely output at each time-step. This gives us the alignment with the highest probability:

A * = argmax PG=_i pt(at\X) (eq. 3)

[0036] Accordingly, by finding the most likely (e.g., highest probability) character for each feature, the neural network model can determine a final result of the language prediction, or language classification.

[0037] In some embodiments, the training data includes a first data item in one of the plurality of languages. The one or more neural networks is trained by training the one or more neural networks to recognize the first data item in each of the plurality of languages.

[0038] In some embodiments, the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.

[0039] The data processing module 228 includes a data pre-processing modules 314, a model-based processing module 316, and a data post-processing module 318. The data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the pre- processing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model-based processing module 316. Examples of the content data include one or more of: video, image, audio, textual, and other types of data. For example, each image is preprocessed to extract an ROI or cropped to a predefined image size, and an audio clip is preprocessed to convert to a frequency domain using a Fourier transform. In some situations, the content data includes two or more types, e.g., video data and textual data. The model-based processing module 316 applies the trained multi-language text recognition model 240 provided by the model training module 226 to process the pre- processed content data. The model-based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 240. In some embodiments, the processed content data is further processed by the data post-processing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.

[0040] FIG. 5 depicts an example of the text recognition model 240, also referred to herein as the multi-language text recognition model. Particularly, FIG. 5 illustrates an example architecture of the text recognition model 240, showing the various processing layers used to implement the model 240. In the example of FIG. 5, the text recognition model 240 includes: input 241 for receiving the image; convolution neural network (CNN) 242 for feature extraction; sequence layer 243 for feature encoding; and dense layer 244 for classification by matching the encoded feature to the index of characters. As alluded to above, the particular model that is employed plays a significant role in OCR. For example, the CNN model that serves as the framework of OCR-based language recognition application will directly impact the efficiency and accuracy of the application. Accordingly, the data processing model 240 is robust and flexible for any size input and multi-languages. The data processing model 240 is distinctly structured to have an efficient model size and running speed for implementing on differing hardware platforms, while still achieving the required accuracy to provide useful language detection in many real-world scenarios and environments. Generally, the sequence of the dense layer 244 of the data processing model 240 is modified. The text recognition model 240 is also trained with all of needing languages, in orderto support the multi-language aspects. Tus, the data processing model 240 can recognize the scene text in an image, and detect the language as well.

[0041] In some embodiments, the CNN 242 is applied in a data processing model 240 to process content data (particularly, video and image data). The CNN 242 employs convolution operations and belongs to a class of deep neural networks (i.e., a feedforward neural network that only moves data forward from the input layer 241 through the dense layer 244). The one or more layers of the CNN 242 are convolutional layers convolving with a multiplication or dot product. Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer, and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network. Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN 242. The pre-processed video or image data is abstracted by each layer of the CNN 242 to a respective feature map. By these means, video and image data can be processed by the CNN 242 for video and image recognition, classification, analysis, imprinting, or synthesis. The CNN 242 can be implemented using various types of neural network models as deemed necessary and/or appropriate. Examples of convolutional neural network include Densenet, ResNet50 for online model and Mobile NetV2, MobileNetVB, GhostNet-light for offline models. The type of model used to implement the CNN 242 can be a design choice that is based on several factors related to the operating environment, such as restrictions on the speed and memory of the desired hardware platform.

[0042] In operation, the text recognition model 240 receives text extracted from an image as input. For example, a text detection module can be employed to get the accurate location of the text, by given four corner points' coordinates. Subsequently, the image can be cropped in order to automatically extract any text that may be present in the scene. FIG. 6 illustrates an example of an image 600 that is cropped such that the text 610 can be analyzed by the data processing model 240.

[0043] As seen in FIG. 6, the image 600 includes a scene of signage in front of a wall. The signage has text written in four different languages. Particularly, each line of text in the signage is written in a respective language. In the example, there is a cropping box 605 surrounding the text 606 of the first line on the signage, and another cropping box 610 surrounding the text 611 of the second line on the signage. Cropping box 610 is shown to have its text 611 extracted from the image 600. Thus, the multi-language text recognition model 240 receives the portion of the image 600 that has been cropped by cropping box 610, which specifically includes the characters of the text 611. By cropping the input image 600, non-textual areas (e.g., areas that correspond to graphical elements or blank space) are removed. An area in the input image 501 with textual information is determined and extracted from this area to be used as an input for the CNN layer 242. As seen in FIG. 6, the cropping boxes 605, 610, are draw as rectangular bounding boxes around areas of the detected text. The size of the cropped images is adaptive to the font size or paragraph size of the textual content or is user adjustable. In some embodiments, a user can manually select text areas (e.g., by drawing rectangular boxes on the original input image) and crops the selected areas to be used as inputs for the CNN layer 242. In another example, each of the cropped images may be split into different frames (e.g., overlapping sub-image of a cropped image) to feed the CNN layer 242 (e.g., using Keras TimeDistributed wrapper).

[0044] The multi-language text recognition model 240 can received the cropped images, extracting the text as shown in FIG. 6, at the input layer 241. The input image can include characters of text, and some complicated backgrounds. Thus, segment-wise visual features from the CNN 242 are initially be extracted. The multi-language text recognition model 240 consists of a convolution neural network (CNN) implemented at the CNN layer 242, and recurrent neural networks implemented at the sequence layer 243. The convolutional layers automatically extract a feature sequence from each input image. On top of the convolutional network, a recurrent network is built for predicting each frame of the feature sequence, output by the convolutional layers. As referred to herein, a CNN is common type of neural networks used in computer vision to recognize objects and patterns in images and uses filters (e.g., matrices with randomized number values) within convolutional layers (e.g., an input is transformed before being passed to a next layer). Specifically, a textual region of the input image is divided to a plurality of segments. The CNN 242 extracts a feature vector for each segment and arranges the feature vectors for the plurality of segments into an ordered feature sequence.

[0045] After the features are extracted from the cropped image by the CNN 242, the sequence layer 24S can perform feature encoding. For feature encoding, the sequence layer 24S can map a sliced feature with each word or character using Bidirectional Long Short Memory Term (BiLSTM). FIG. 7 illustrates an example of feature encoding using BiLSTM that may be performed by the sequence layer 243.

[0046] FIG. 7 shows an input image 706 is a grayscale cropped image, e.g., which is processed from the input image received by the multi-language text recognition model 240 (shown in FIG. 5). As seen, the input image 706 includes English text "STATE." The CNN 242 (shown in FIG. 5) coverts the input image 706 into a feature sequence 702 that represents an image descriptor of the input image 706. Specifically, the input image 706 is sliced or divided into a plurality of segments 710. The CNN 242 extracts a feature vector 704 from each segment 710 in the input image 706 and arranges the feature vectors 704 of the plurality of segments 710 into an ordered feature sequence 702. The feature sequence 702 is further processes by a recurrent neural network, namely the sequence layer 243 (shown in FIG. 5), to update the feature vectors 704, based on a spatial context of each segment 710 in the input image 706.

[0047] For a specific segment 710 in the input image 706, the corresponding feature vector 704 includes an ordered sequence of vector elements corresponding to a plurality of characters in a dictionary. Each vector element represents a probability value of the specific segment 710 representing a corresponding character in the dictionary. The vector elements can be grouped into a sequence of feature subsets, and each feature subset corresponds to a respective one of the languages in the dictionary. In some embodiments, the number of characters can be the number of units in the dense layer.

[0048] Typically, LSTM is directional, using past contexts. However, for analyzing image-based sequences, contexts from both directions are useful and complementary to each other. Therefore, referring to FIG. 5, the sequence layer 243 is configured to combine two LSTMs, one forward and one backward, into a bidirectional LSTM. Furthermore, the sequence layer 243 can stack multiple bidirectional LSTMs, resulting in a deep bidirectional LSTM.

[0049] In some embodiments, the sequence layer 243 can use other forms of recurrent neural networks like Gated Recurrent Unit (GRU), which is very similar to LSTM, and consisting of update and reset gates. In this embodiment, the sequence layer 243 can use a two-direction GPU as a bi-directional GRU (BiGRU) module in order to perform feature encoding. The deep structure achieved from a BiGRU, for example, allows a higher level of abstractions than a shallow structure, and thus can achieved significant performance improvements in the task of text recognition.

[0050] Next, at the dense layer 244, the features that are encoded by the sequence layer 243 can be used to classify the language of the text captured in the input image. In other words, the dense layer 244 is distinctly configured to recognize the text in the scene and to detect, or otherwise classify, the language of the recognized text. The dense layer 244 acts a bridge connecting the features output by the neural network to the exact characters in order to determine the specific index in the dictionary of each feature. For example, the multi language text recognition model 240 can be trained as an English text recognition model. According to this embodiment, the dictionary can be set to: dictl = {0: 71’ , 1: ‘ ’ , 2: ‘B’ , 3: ‘C’ , . } (eq. 4)

[0051] Accordingly, the dense layer 244 can output a special index sequence representing the features. For instance, a special index sequence generated by the dense layer can be [14, 32, 29, . ]. By looking up the dictionary, the index sequence can be translated it into a meaningful sentence, such as 'How are you?'. Thus, the language classification performed by the dense layer 244 can generally be described as deciphering the index sequence resulting from the encoded features of the image.

[0052] In order to support the multi-language capabilities of the multi-language text recognition model 240, dictionaries representing several languages are combined into dictionary. After enlarging the dictionary, the order of each character can be set to: dict2 = {0: 71’ , 1: Ά’ , ---50: 'z' , 51: ‘ .4500: Ί’ ,

4501: ‘ 6 ’ , . } (eq. 5)

[0053] In this way, a specific distribution of each language can be determined. In the example of dict2 (assuming that the dictionary includes 10,000 characters) an index interval of 0 to 50 can belong to a first language, such as English, an index interval of 51-4500 can belong to second language, for instance, Chinese, another index interval of 4501-7500 can belong to a third language, such as Japanese, and the remaining index interval 7501-10,000 can belong to a fourth language, for instance Korean. With this ordered dictionary, we can easily know the language of the prediction results. Thus, by having a predefined index, where each interval of the index corresponds to the characters of a particular language, a classification for the language can ultimately be determined from the index sequence. For instance, referring back to the example index sequence [14, 32, 29, . ], the dense layer 244 can ascertain that the language is English, since all indexes are in the interval between 1 and 50 corresponding to English.

[0054] In cases where an index sequence generated by the dense layer 244 falls within different index intervals of the dictionary, suggesting that the text includes more than one language, a language classification can ultimately be selected using an additional voting feature. For example, the dense layer 244 can be configured to determine a probability for each language indicated by an index sequence, where the probability represents the likelihood, or a vote, that the index sequence is in that particular language. As an example, an index sequence can have the majority of its indexes falling within the interval corresponding to English, and a smaller number of indexes falling within the interval corresponding to Chinese. In this example, the voting feature may generate a higher percentage, or more votes, for English for the index sequence; and a lower percentage, or fewer votes, for Chinese for the index sequence. Thus, according to the voting feature, the dense layer 244 may select English as the language classification for that index sequence (as opposed to Chinese), as it is the language has the highest votes.

[0055] FIG. 8 is a flowchart illustrating an exemplary OCR process 800 that can implement the autonomous language classification (for multiple languages) and text recognition features, as disclosed herein. FIG. 8 shows process 800 as a series of executable operations stored on a machine-readable storage medium 806 performed by hardware processors 804, which can be the main processor of a computing component 802. For example, the computing component 802 can be a computing device implementing the OCR application described at least in reference to FIG. 1A. In operation, hardware processors 804 execute the operations of process 800, thereby implementing the disclosed autonomous language classification (for multiple languages) and text recognition techniques.

[0056] At operation 805, the process 800 begins by extracting features from an image having pictorial representation of textual content in one or more languages. In some embodiments, the image is a three-channel image that includes both textual content and non textual content. Operation 805 can involve employing a neural network, such as a CNN, to extract segment wise visual features corresponding to the textual content of the image. The CNN can recognize objects and patterns in images, and a textual region of the input image can be divided to a plurality of segments by the CNN. Then, the CNN can extract a feature vector for each segment and arranges the feature vectors for the plurality of segments into an ordered feature sequence.

[0057] Subsequently, at operation 810, the process 800 encodes the features corresponding to the textual content of the image. For example, feature vectors output from the CNN in the previous operation 805, are then fed to a recurrent neural network for feature labeling. A recurrent neural network is designed to interpret spatial context information by receiving the feature vectors corresponding to the segments of the input image and reusing activations of preceding or following segments in the input image to determine a textual recognition output for each segment in the input image. In some embodiments, encoding the features involves mapping the sliced feature vector containing textual information using BiLSTM. As such, after the convolutional neural network 504 automatically extracts a feature sequence having feature vectors associated with segments in the input image, the recurrent neural network can predict textual content corresponding to the feature sequence of the input image based on at least the spatial context of each segment in the input image.

[0058] Next, the process 800 can continue to operation 815. At operation 815, the encoded features can be matched to an index of characters for multiple languages. For example, the output vector from a dense layer of the multi-language text recognition model may be a vector of length N. Each index in the output vector corresponds to a respective key in a hash map, wherein the corresponding value represents a matching character in a specific language. Each value of the output vector from the dense layer represents a probability that the recognized character is the corresponding value in the hash map. For example, an output vector from the dense layer may be a vector of 8000 elements, defined as:

V output = [ai, 32, a 3, ...a800o] (eq. 6)

[0059] A corresponding dictionary is a hash map, defined as:

D = [1 : v , 2: v 2, 3: v 3, 8000: v sooo] (eq. 7)

[0060] Each of the values in the dictionary (vl- t¾ooo) corresponds to a character in a particular language. For example, v_\ - Vwoo may correspond to 4000 different characters in Chinese, V4001 - V6000 may correspond to 2000 different characters in Japanese, i7₆ooi - V7000 may correspond to 1000 different characters in Korean, and V7001 - t¾ooo may correspond to 1000 different characters in Latin. As such, the output vector from the dense layer corresponds to a segment in the input image. Each element in the output vector corresponds to a character of a respective language, and a value of the respective element indicates a probability of textual content of the segment in the input image corresponding to the character of the respective language. Consequently, by ultimately matching each segment of the textual content (vis-a-vis the encoded features) to a character in a specific language, the text for multiple languages within the input image can be automatically recognized.

[0061] Thereafter, at operation 820, the language for the textual content of the input image can be autonomously classified. As previously described, the dictionary defines a range of values in the hash map to corresponds to the characters in a particular language. Thus, this known correlation between value ranges (set in the dictionary) and known languages, can be leveraged to classify a language for the recognized text. That is, the range of values of the dictionary that each element in the output vector falls in can be analyzed to determine a language classification for the text. For example, if each element of the output vector falls within the v_\ - 174000 range, that indicates that each segment of the text content in the image corresponds to a Chinese character. Accordingly, the recognized text may be classified as Chinese in operation 820. Therefore, the method 800 can achieve both STR and language detection in a manner that is fully autonomous (e.g., frees the user form having to select a specific language model), while realizing increased performance in precision and/or recall metrics.

[0062] FIG. 9 depicts a block diagram of an example computer system 900 in which various of the text multi-language text recognition and autonomous language classification features described herein may be implemented. The computer system 900 includes a bus 902 or other communication mechanism for communicating information, one or more hardware processors 904 coupled with bus 902 for processing information. Hardware processor(s) 904 may be, for example, one or more general purpose microprocessors. [0063] The computer system 900 also includes a main memory 906, such as a random-access memory (RAM), cache and/or other dynamic storage devices, coupled to bus 902 for storing information and instructions to be executed by processor 904. Main memory 906 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 904. Such instructions, when stored in storage media accessible to processor 904, render computer system 900 into a special-purpose machine that is customized to perform the operations specified in the instructions.

[0064] The computer system 900 further includes a read only memory (ROM) 908 or other static storage device coupled to bus 902 for storing static information and instructions for processor 904. A storage device 910, such as a magnetic disk, optical disk, or USB thumb drive (Flash drive), etc., is provided and coupled to bus 902 for storing information and instructions.

[0065] The computer system 900 may be coupled via bus 902 to a display 912, such as a liquid crystal display (LCD) (or touch screen), for displaying information to a computer user. An input device 914, including alphanumeric and other keys, is coupled to bus 902 for communicating information and command selections to processor 904. Another type of user input device is cursor control 916, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 904 and for controlling cursor movement on display 912. In some embodiments, the same direction information and command selections as cursor control may be implemented via receiving touches on a touch screen without a cursor.

[0066] The computing system 900 may include a user interface module to implement a GUI that may be stored in a mass storage device as executable software codes that are executed by the computing device(s). This and other modules may include, by way of example, components, such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.

[0067] In general, the word "component," "engine," "system," "database," data store," and the like, as used herein, can refer to logic embodied in hardware or firmware, or to a collection of software instructions, possibly having entry and exit points, written in a programming language, such as, for example, Java, C or C++. A software component may be compiled and linked into an executable program, installed in a dynamic link library, or may be written in an interpreted programming language such as, for example, BASIC, Perl, or Python. It will be appreciated that software components may be callable from other components or from themselves, and/or may be invoked in response to detected events or interrupts. Software components configured for execution on computing devices may be provided on a computer readable medium, such as a compact disc, digital video disc, flash drive, magnetic disc, or any other tangible medium, or as a digital download (and may be originally stored in a compressed or installable format that requires installation, decompression or decryption prior to execution). Such software code may be stored, partially or fully, on a memory device of the executing computing device, for execution by the computing device. Software instructions may be embedded in firmware, such as an EPROM. It will be further appreciated that hardware components may be comprised of connected logic units, such as gates and flip- flops, and/or may be comprised of programmable units, such as programmable gate arrays or processors.

[0068] The computer system 900 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system causes or programs computer system 900 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 900 in response to processor(s) 904 executing one or more sequences of one or more instructions contained in main memory 906. Such instructions may be read into main memory 906 from another storage medium, such as storage device 910. Execution of the sequences of instructions contained in main memory 906 causes processor(s) 904 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.

[0069] The term "non-transitory media," and similar terms, as used herein refers to any media that store data and/or instructions that cause a machine to operate in a specific fashion. Such non-transitory media may comprise non-volatile media and/or volatile media. Non-volatile media includes, for example, optical or magnetic disks, such as storage device 910. Volatile media includes dynamic memory, such as main memory 906. Common forms of non-transitory media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge, and networked versions of the same.

[0070] Non-transitory media is distinct from but may be used in conjunction with transmission media. Transmission media participates in transferring information between non-transitory media. For example, transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 902. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.

[0071] The computer system 900 also includes a communication interface 918 coupled to bus 902. Communication interface 918 provides a two-way data communication coupling to one or more network links that are connected to one or more local networks. For example, communication interface 918 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 918 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN (or WAN component to communicated with a WAN).

Wireless links may also be implemented. In any such implementation, communication interface 918 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.

[0072] A network link typically provides data communication through one or more networks to other data devices. For example, a network link may provide a connection through local network to a host computer or to data equipment operated by an Internet Service Provider (ISP). The ISP in turn provides data communication services through the world wide packet data communication network now commonly referred to as the "Internet." Local network and Internet both use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on network link and through communication interface 918, which carry the digital data to and from computer system 900, are example forms of transmission media.

[0073] The computer system 900 can send messages and receive data, including program code, through the network(s), network link and communication interface 918. In the Internet example, a server might transmit a requested code for an application program through the Internet, the ISP, the local network and the communication interface 918.

[0074] The received code may be executed by processor 904 as it is received, and/or stored in storage device 910, or other non-volatile storage for later execution.

[0075] Each of the processes, methods, and algorithms described in the preceding sections may be embodied in, and fully or partially automated by, code components executed by one or more computer systems or computer processors comprising computer hardware. The one or more computer systems or computer processors may also operate to support performance of the relevant operations in a "cloud computing" environment or as a "software as a service" (SaaS). The processes and algorithms may be implemented partially or wholly in application-specific circuitry. The various features and processes described above may be used independently of one another or may be combined in various ways. Different combinations and sub-combinations are intended to fall within the scope of this disclosure, and certain method or process blocks may be omitted in some implementations. The methods and processes described herein are also not limited to any particular sequence, and the blocks or states relating thereto can be performed in other sequences that are appropriate, or may be performed in parallel, or in some other manner. Blocks or states may be added to or removed from the disclosed example embodiments. The performance of certain of the operations or processes may be distributed among computer systems or computers processors, not only residing within a single machine, but deployed across a number of machines.

[0076] As used herein, a circuit might be implemented utilizing any form of hardware, software, or a combination thereof. For example, one or more processors, controllers, ASICs, PLAs, PALs, CPLDs, FPGAs, logical components, software routines or other mechanisms might be implemented to make up a circuit. In implementation, the various circuits described herein might be implemented as discrete circuits or the functions and features described can be shared in part or in total among one or more circuits. Even though various features or elements of functionality may be individually described or claimed as separate circuits, these features and functionality can be shared among one or more common circuits, and such description shall not require or imply that separate circuits are required to implement such features or functionality. Where a circuit is implemented in whole or in part using software, such software can be implemented to operate with a computing or processing system capable of carrying out the functionality described with respect thereto, such as computer system 800.

[0077] As used herein, the term "or" may be construed in either an inclusive or exclusive sense. Moreover, the description of resources, operations, or structures in the singular shall not be read to exclude the plural. Conditional language, such as, among others, "can," "could," "might," or "may," unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and/or steps.

[0078] Terms and phrases used in this document, and variations thereof, unless otherwise expressly stated, should be construed as open ended as opposed to limiting. Adjectives such as "conventional," "traditional," "normal," "standard," "known," and terms of similar meaning should not be construed as limiting the item described to a given time period or to an item available as of a given time, but instead should be read to encompass conventional, traditional, normal, or standard technologies that may be available or known now or at any time in the future. The presence of broadening words and phrases such as "one or more," "at least," "but not limited to" or other like phrases in some instances shall not be read to mean that the narrower case is intended or required in instances where such broadening phrases may be absent.

Claims

Claims What is claimed is:

1. A computer-implemented method comprising: extracting visual features corresponding to textual content of an input image, wherein the input image comprises textual content and non-textual context; encoding the extracted features to map each visual feature with a character using a recurrent neural network such that the textual content is recognized; matching the extracted features to an index of characters such that each character corresponds to an index value; and autonomously classifying a language for the recognized text based on the corresponding index values.

2. The computer-implemented method of claim 1, wherein the index of characters comprises a multiple language dictionary having values that each correspond to a character of a language from the multiple language.

3. The computer-implemented method of claim 2, wherein each range of values of the multiple language dictionary corresponds to a language from the multiple language

4. The computer-implemented method of claim 3, wherein the autonomously classifying a language for the recognized text comprises: determining a range of values corresponding to each character, and selecting the language corresponding to the determined range of values as the classified language for the recognized text.

5. The computer-implemented method of claim 3, wherein the extracting the visual features of the input image comprises applying a convolution neural network (CNN).

6. The computer-implemented method of claim 5, wherein the extracting the visual features of the input image comprises: recognizing, by the CNN, objects and patterns in the input image; dividing, by the CNN, one or more regions of the input image including the textual content into a plurality of segments; and extracting, by the CNN, a feature vector for each segment and arranging the feature vectors for the plurality of segments into an ordered feature sequence.

7. The computer-implemented method of claim 6, wherein encoding the extracted features comprises: mapping each vector element of the feature vector to a plurality of characters in the dictionary, wherein each vector element represents a probability value of the segment representing a corresponding character in the dictionary, and the recurrent neural network comprises a Bidirectional Long Short Memory Term (BiLSTM) network.

8. The computer-implemented method of claim 1, wherein the input image is a three- channel image that includes pictorial representations of the textual content in one or more languages.

9. The computer-implemented method of claim 1, further comprising cropping the input image such that the non-textual content is removed, and the textual content is extracted.

10. The computer-implemented method of claim 1, wherein cropping comprises drawing bounding boxes around one or more areas of the input image detected to include textual content.

11. A computer system, comprising: one or more processors; and a memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform: applying optical character recognition (OCR) application to an input image to automatically recognize text content from the input image and to autonomously classify a language corresponding to the recognized text.

12. The computer system of claim 11, wherein the OCR application comprises a single multi-language text recognition model.

13. The computer system of claim 12, wherein the multi-language text recognition model comprises: an input layer for receiving the input image; a convolution neural network layer for extracting features from the input image; a sequence layer for encoding the extracted features; and a dense layer for classifying the language corresponding to the recognized text by matching the encoded feature to an index of characters.

14. The computer system of claim IB, wherein the memory stores the index of characters as a multiple language dictionary having values that each correspond to a character of a language from the multiple language.

15. The computer system of claim 14, wherein the multiple language dictionary is defined as D = [1 : V i, 2: V , 3: 1 3, ...8000: Vsooo].

16. The computer system of claim 15, wherein values V\ - v₄ooo correspond to a first language, Vwoi - Veooo correspond to a second language, vemi - V7000 correspond to a third language, and V7001 - t¾ooo correspond a fourth language.

17. A computer-readable medium storing instructions that, when executed by a processor, causes the processor to perform: extracting visual features corresponding to textual content of an input image, wherein the input image comprises textual content and non-textual context; encoding the extracted features to map each visual feature with a character using Bidirectional Long Short Memory Term (BiLSTM) such that the textual content is recognized; matching the extracted features to an index of characters such that each character corresponds to an index value; and autonomously classifying a language for the recognized text based on the corresponding index values.

18. The computer-readable medium of claim 17, wherein the index of characters comprises a multiple language dictionary having values that each correspond to a character of a language from the multiple language.

19. The computer-readable medium of claim 18, wherein each range of values of the multiple language dictionary corresponds to a language from the multiple language.

20. The computer-readable medium of claim 19, wherein the autonomously classifying a language for the recognized text comprises: determining a range of values corresponding to each character, and selecting the language corresponding to the determined range of values as the classified language for the recognized text.