CN113226886A

CN113226886A - Method and device for controlling vehicle to run and vehicle

Info

Publication number: CN113226886A
Application number: CN202180001475.0A
Authority: CN
Inventors: 苏琪; 聂为然; 许明霞
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2021-08-06
Also published as: WO2022205211A1

Abstract

The application provides a method and a device for controlling a vehicle to run and the vehicle, relates to the field of artificial intelligence, automatic driving or automatic driving, and can be applied to intelligent automobiles, internet automobiles and automatic driving automobiles. Wherein, the method comprises the following steps: acquiring a user instruction in an automatic driving mode of a vehicle; acquiring environmental information around a vehicle; performing multi-modal understanding on the user instruction and the environmental information around the vehicle, and determining the driving intention of the user; an automatic driving control command for a vehicle is generated according to a driving intention of a user. According to the technical scheme, the experience of the user in the automatic driving process can be improved.

Description

Method and device for controlling vehicle to run and vehicle

Technical Field

The present application relates to the field of automatic driving, and more particularly, to a method, an apparatus, and a vehicle for controlling vehicle driving.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision and reasoning, human-computer interaction, recommendation and search, AI basic theory, and the like.

Automatic driving is a mainstream application in the field of artificial intelligence, and the automatic driving technology depends on the cooperative cooperation of computer vision, radar, a monitoring device, a global positioning system and the like, so that the motor vehicle can realize automatic driving without the active operation of human beings. Autonomous vehicles use various computing systems to assist in transporting users from one location to another. Some autonomous vehicles may require some initial input or continuous input from a user (such as a pilot, driver, or passenger). Autonomous vehicles permit an operator to switch from a manual mode of operation to an autonomous mode or an intervening mode. Because the automatic driving technology does not need human to drive the motor vehicle, the driving error of human can be effectively avoided theoretically, the occurrence of traffic accidents is reduced, and the transportation efficiency of the road can be improved. Therefore, the automatic driving technique is increasingly emphasized.

At present, the driving basis of an automatic driving vehicle is to send a user to a corresponding destination through a planned route according to a preset destination and the surrounding environment of the vehicle obtained through various sensors. However, during actual vehicle travel, the user may have some temporary intention different from the intention to travel to the destination based on visual information around the vehicle, such as: seeing an acquaintance at the roadside, needing to temporarily park, and calling him; feel to be closer to the front car, need to pull apart the distance, etc. However, in the conventional automatic driving technology, if the user generates the temporary intention, the user can temporarily take over the control right of the vehicle only by means of manual intervention and then execute the temporary intention related to the user. Since the vehicle has now been switched to an artificial driving mode, the user can no longer enjoy the more worry-saving, safer driving experience brought about by the automated driving technique. In addition, when the Level of the automatic driving is at Level 5 (Level 5, L5) (according to the definition of Society of Automotive Engineers (SAE) on the automation Level), the manual intervention function of the vehicle may be cancelled, and the driver will not be able to perform the above temporary intention, so that the user experience is degraded.

Therefore, how to improve the experience feeling of the user in the automatic driving process is an urgent problem to be solved.

Disclosure of Invention

The application provides a method and a device for controlling a vehicle to run and the vehicle, which can improve the experience of a user in the automatic driving process.

In a first aspect, a method for controlling vehicle running is provided, and the method for controlling vehicle running provided by the application can be executed by an electronic device supporting control of vehicle running. Electronic devices are meant to be abstractable as computer systems. The electronic device supporting the control of the vehicle running in the present application may also be referred to as a device controlling the vehicle running. The device for controlling the vehicle to run can be a complete machine of the electronic device, and can also be a part of devices in the electronic device, such as: and a chip related to controlling the running function of the vehicle, such as a system chip. The SOC is also called a System On Chip (SOC). Specifically, the device for controlling the vehicle to run may be a terminal device or an on-vehicle apparatus such as an on-vehicle computer, a vehicle machine, a mobile phone, or the like in the vehicle, or may be a processor, a system chip, or another type of on-vehicle chip that can be provided in a computer system in the vehicle or the on-vehicle apparatus.

The method comprises the following steps: acquiring a user instruction in an automatic driving mode of a vehicle; acquiring environmental information around a vehicle; performing multi-modal understanding on the user instruction and the environmental information around the vehicle, and determining the driving intention of the user; an automatic driving control command for a vehicle is generated according to a driving intention of a user.

In the embodiment of the application, in an automatic driving mode of a vehicle, the driving intention of a user can be determined by acquiring a user instruction and environmental information around the vehicle and performing multi-mode understanding on the user instruction and the environmental information around the vehicle; and generating an automatic driving control command for the vehicle according to the driving intention of the user. Therefore, when the vehicle runs in the automatic driving mode, the temporary driving intention of the user can be executed, the user does not need to execute the temporary driving intention in a mode of manually taking over the control right, and the experience of the user in the automatic driving process can be improved.

With reference to the first aspect, in certain implementations of the first aspect, the driving intent comprises at least one intent, each intent of the at least one intent comprising n slots, each slot of the n slots comprising a slot name, a slot value, and a classification of slot value, n being greater than or equal to 0, n being an integer.

With reference to the first aspect, in certain implementations of the first aspect, it is intended to include: at least one of parking, overtaking, decelerating, following and steering.

With reference to the first aspect, in certain implementations of the first aspect, the slot name includes: at least one of a position of parking, a speed value, an object of passing or following, and a steering orientation.

With reference to the first aspect, in certain implementations of the first aspect, the classification of the slot bit value is: enumerating a class slot position value, a text class slot position value or an environment class slot position value,

the enumeration type slot value indicates that the slot value is a predefined enumeration value, the text type slot value indicates that the slot value is a substring in a user instruction or a text generated according to the user instruction, and the environment type slot value indicates that the slot value is an identifier made in the environment information according to the content mentioned in the user instruction.

Optionally, the environment class slot value includes an image class slot value, and the image can reflect the environment around the vehicle. Thus, the image class slot value may indicate that the slot value is an identification made in the image information according to what is mentioned in the user instruction.

With reference to the first aspect, in certain implementations of the first aspect, generating the automatic driving control instruction for the vehicle according to the travel intention of the user includes: judging whether the driving intention is feasible or not according to the driving intention, the surrounding environment and traffic regulations; and if the driving intention is feasible, generating an automatic driving control instruction for the vehicle.

Alternatively, if the driving intention is not feasible, a prompt message can be generated and sent to the user.

Alternatively, the reason why the travel intention is not feasible may be included in the prompt information.

In the embodiment of the application, after the driving intention is determined, whether the driving intention is feasible is judged according to the driving intention, the surrounding environment and traffic regulations; and if the driving intention is feasible, regenerating an automatic driving control command for the vehicle. Therefore, the situation that the driving intention of the user is violated or other problems occur when the driving intention of the user is executed in the automatic driving mode can be avoided, and the user experience and the safety of automatic driving in the automatic driving process are ensured.

With reference to the first aspect, in certain implementations of the first aspect, the user instruction includes: any one or more of a user voice instruction, a user text instruction, a user clear gesture instruction.

Optionally, if the actually obtained user instruction is a user voice instruction or a user space gesture instruction, in an actual operation, the user voice instruction or the user space gesture instruction may be converted into a user text instruction, and then the text instruction and the surrounding environment information are subjected to multi-modal understanding, or the user voice instruction or the user space gesture instruction may be directly subjected to multi-modal understanding, which is not limited in the present application.

With reference to the first aspect, in certain implementations of the first aspect, the method further includes: sending a shooting activation signal to a shooting device to activate the shooting device to shoot the environmental information around the vehicle; acquiring environmental information around the vehicle includes: the environmental information around the vehicle photographed by the photographing device according to the photographing activation signal is acquired.

It should be understood that the environment information captured by the capturing device may also be recorded as image information. However, it should be understood that, in actual operation, the acquired environment information may be other than the image information captured by the capturing device; the environmental information that can also be acquireed such as laser radar, on-vehicle sensor and/or car networking, this application does not limit this.

With reference to the first aspect, in certain implementations of the first aspect, the obtaining environmental information around the vehicle includes: environmental information around the vehicle periodically captured by the capturing device is acquired.

With reference to the first aspect, in certain implementations of the first aspect, the driving intent of the user is presented to the user through an augmented reality-heads-up display AR-HUD or a center screen.

In the embodiment of the application, the driving intention of the user can be presented to the user in a mode of AR-HUD or a central control screen, so that the user can judge the correctness of the multi-modal understanding result in time.

In a second aspect, there is provided an apparatus for controlling a vehicle to travel, the apparatus including an acquisition unit for acquiring a user instruction in an automatic driving mode of the vehicle; the acquisition unit is also used for acquiring environmental information around the vehicle; the processing unit is used for performing multi-modal understanding on the user instruction and the environmental information around the vehicle and determining the driving intention of the user; the processing unit is further used for generating an automatic driving control instruction for the vehicle according to the driving intention of the user.

With reference to the second aspect, in certain implementations of the second aspect, the driving intent includes at least one intent, each of the at least one intent includes n slots, each of the n slots includes a slot name, a slot value, and a classification of slot value, n is greater than or equal to 0, and n is an integer.

With reference to the second aspect, in certain implementations of the second aspect, it is intended to include: at least one of parking, overtaking, decelerating, following and steering.

With reference to the second aspect, in some implementations of the second aspect, the slot name includes: at least one of a position of parking, a speed value, an object of passing or following, and a steering orientation.

With reference to the second aspect, in certain implementations of the second aspect, the classification of the slot bit value is: the method comprises the steps of enumerating slot position values, text slot position values or environment slot position values, wherein the enumerating slot position values represent that the slot position values are predefined enumeration values, the text slot position values represent that the slot position values are substrings in user instructions or texts generated according to the user instructions, and the environment slot position values represent that the slot position values are marks made in environment information according to contents mentioned in the user instructions.

With reference to the second aspect, in some implementations of the second aspect, the processing unit is further configured to: judging whether the driving intention is feasible or not according to the driving intention, the surrounding environment and traffic regulations; and if the driving intention is feasible, generating an automatic driving control instruction for the vehicle.

With reference to the second aspect, in some implementations of the second aspect, the user instruction includes: any one or more of a user voice instruction, a user text instruction, a user clear gesture instruction.

With reference to the second aspect, in certain implementations of the second aspect, the apparatus further includes: the transmitting unit is used for transmitting a shooting activation signal to the shooting device so as to activate the shooting device to shoot the environmental information around the vehicle; the acquisition unit is further configured to: the environmental information around the vehicle photographed by the photographing device according to the photographing activation signal is acquired.

With reference to the second aspect, in some implementations of the second aspect, the obtaining unit is further configured to: environmental information around the vehicle periodically captured by the capturing device is acquired.

With reference to the second aspect, in some implementations of the second aspect, the driving intent of the user is presented to the user through an augmented reality-heads-up display AR-HUD or a center screen.

In a third aspect, a training method for a multi-modal processing module is provided, including: acquiring training data, wherein the training data comprises training input data and training target data, the training input data comprises a user instruction and environmental information around a vehicle, and the training target data comprises a driving intention corresponding to the training input data; and training the multi-mode processing module according to the training input data and the training target data.

With reference to the third aspect, in certain implementations of the third aspect, the driving intent includes at least one intent, each of the at least one intent includes n slots, each of the n slots includes a slot name, a slot value, and a classification of slot values, n is greater than or equal to 0, and n is an integer.

With reference to the third aspect, in certain implementations of the third aspect, it is intended to include: at least one of parking, overtaking, decelerating, following and steering.

With reference to the third aspect, in certain implementations of the third aspect, the slot name includes: at least one of a position of parking, a speed value, an object of passing or following, and a steering orientation.

With reference to the third aspect, in certain implementations of the third aspect, the classification of the slot bit value is: the method comprises the steps of enumerating slot position values, text slot position values or environment slot position values, wherein the enumerating slot position values represent that the slot position values are predefined enumeration values, the text slot position values represent that the slot position values are substrings in user instructions or texts generated according to the user instructions, and the environment slot position values represent that the slot position values are marks made in environment information according to contents mentioned in the user instructions.

In a fourth aspect, a training device of a multi-modal processing module is provided, which includes an obtaining unit and a processing unit, wherein the obtaining unit is configured to obtain training data, the training data includes training input data and training target data, the training input data includes a user instruction and environmental information around a vehicle, and the training target data includes a driving intention corresponding to the training input data; the processing unit is used for training the multi-mode processing module according to the training input data and the training target data.

With reference to the fourth aspect, in certain implementations of the fourth aspect, the driving intent includes at least one intent, each of the at least one intent includes n slots, each of the n slots includes a slot name, a slot value, and a classification of slot values, n is greater than or equal to 0, and n is an integer.

With reference to the fourth aspect, in certain implementations of the fourth aspect, it is intended to include: at least one of parking, overtaking, decelerating, following and steering.

With reference to the fourth aspect, in some implementations of the fourth aspect, the slot name includes: at least one of a position of parking, a speed value, an object of passing or following, and a steering orientation.

With reference to the fourth aspect, in certain implementations of the fourth aspect, the classification of the slot bit value is: the method comprises the steps of enumerating slot position values, text slot position values or environment slot position values, wherein the enumerating slot position values represent that the slot position values are predefined enumeration values, the text slot position values represent that the slot position values are substrings in user instructions or texts generated according to the user instructions, and the environment slot position values represent that the slot position values are marks made in environment information according to contents mentioned in the user instructions.

In a fifth aspect, there is provided another method of controlling travel of a vehicle, comprising: acquiring a user instruction in an automatic driving mode of the vehicle; acquiring environmental information around the vehicle; determining the driving intention of the user according to the user instruction and the environment information; generating an automatic driving control instruction for the vehicle according to at least the driving intention of the user; and controlling the vehicle to run based on the automatic driving control command.

In the embodiment of the application, in the automatic driving mode of the vehicle, the driving intention of a user can be determined by acquiring the user instruction and the environmental information around the vehicle and according to the user instruction and the environmental information; and generating an automatic driving control command for the vehicle according to the driving intention of the user. Therefore, when the vehicle runs in the automatic driving mode, the temporary driving intention of the user can be executed, the user does not need to execute the temporary driving intention in a mode of manually taking over the control right, and the experience of the user in the automatic driving process can be improved.

With reference to the fifth aspect, in some implementations of the fifth aspect, the determining the driving intent of the user according to the user instruction and the environmental information includes: performing multi-modal understanding of the user instruction and the environmental information; based on the results of the multi-modal understanding, the travel intent of the user is determined.

With reference to the fifth aspect, in some implementations of the fifth aspect, the user instruction includes: at least one of a user voice instruction, a user text instruction, a user clear gesture instruction.

With reference to the fifth aspect, in certain implementations of the fifth aspect, the driving intent comprises at least one intent, each of the at least one intent comprising n slots, each of the n slots comprising a slot name, a slot value, and a classification of slot value, n being greater than or equal to 0, n being an integer.

With reference to the fifth aspect, in certain implementations of the fifth aspect, it is intended to include: at least one of parking, overtaking, decelerating, following and steering.

With reference to the fifth aspect, in some implementations of the fifth aspect, the slot name includes: at least one of a position of parking, a speed value, an object of passing or following, and a steering orientation.

With reference to the fifth aspect, in certain implementations of the fifth aspect, the classification of the slot bit value is: enumerating a class slot position value, a text class slot position value or an environment class slot position value,

With reference to the fifth aspect, in some implementations of the fifth aspect, the automatic driving control instruction for the vehicle is generated according to a driving intention of the user; the method comprises the following steps: judging whether the driving intention is feasible or not according to the driving intention, the surrounding environment and traffic regulations; and if the driving intention is feasible, generating an automatic driving control instruction for the vehicle.

With reference to the fifth aspect, in some implementations of the fifth aspect, if the obtaining of the user instruction is obtaining of a text instruction of the user, before obtaining of the text instruction of the user, a natural voice instruction of the user or a user gesture instruction of a user may be obtained; and then converting the natural voice command or the user space gesture command into a text command.

With reference to the fifth aspect, in certain implementations of the fifth aspect, the method further comprises: sending a shooting activation signal to a shooting device to activate the shooting device to shoot the environmental information around the vehicle; acquiring environmental information around the vehicle includes: the environmental information around the vehicle photographed by the photographing device according to the photographing activation signal is acquired.

With reference to the fifth aspect, in certain implementations of the fifth aspect, the obtaining environmental information around the vehicle includes: environmental information around the vehicle periodically captured by the capturing device is acquired.

With reference to the fifth aspect, in some implementations of the fifth aspect, the driving intent of the user is presented to the user through an augmented reality-heads-up display AR-HUD or a center screen.

In a sixth aspect, another apparatus for controlling vehicle travel is provided, which includes modules that can implement the method for controlling vehicle travel in the fifth aspect or any possible implementation manner of the fifth aspect.

A seventh aspect provides a processing method of a multi-modal processing module, where the multi-modal processing module is obtained by training according to the training method in the third aspect or any possible implementation manner of the third aspect; the processing method comprises the following steps: the method comprises the steps that a multi-mode processing module obtains input data, wherein the input data comprise user instructions and environment information around a vehicle; the multi-modal processing module outputs the driving intention according to the input data.

In an eighth aspect, a multi-modal processing module is provided, which is obtained by training according to the training method in the third aspect or any possible implementation manner of the third aspect; the multi-modality processing module includes: an acquisition unit configured to acquire input data including a user instruction and environmental information around a vehicle; and the processing unit is used for outputting the driving intention according to the input data.

A ninth aspect provides an autonomous vehicle comprising the apparatus of the second aspect or any possible implementation manner of the second aspect; and/or an apparatus comprising any one of the above fourth aspect or any possible implementation manner of the fourth aspect; and/or an apparatus comprising the sixth aspect or any possible implementation manner of the sixth aspect; and/or include the modules in the above eighth aspect or any possible implementation manner of the eighth aspect;

in a tenth aspect, there is provided an apparatus for controlling vehicle travel, comprising a processor and a memory, wherein the memory is used for storing program instructions, and the processor is used for calling the program instructions to execute the method for controlling vehicle travel in the first aspect or any possible implementation manner of the first aspect; and/or invoking the program instructions to execute the other method for controlling vehicle running in the fifth aspect or any possible implementation manner of the fifth aspect.

In an eleventh aspect, there is provided an apparatus for training a multi-modal processing module, including a processor and a memory, where the memory is configured to store program instructions, and the processor is configured to call the program instructions to execute a method for training the multi-modal processing module in the third aspect or any possible implementation manner of the third aspect.

In a twelfth aspect, a system is provided, where the system includes the apparatus in the second aspect or any possible implementation manner of the second aspect; and/or an apparatus according to any of the above sixth aspect or any possible implementation manner of the sixth aspect.

Optionally, the system may be a vehicle, and may also be an on-board system on the vehicle, which is not limited in this application.

In a thirteenth aspect, there is provided a computer program product containing instructions for causing a computer to perform the method for controlling vehicle travel of the first aspect or any of the possible implementations of the first aspect when the computer program product runs on the computer; and/or performing the other method for controlling vehicle travel in the fifth aspect or any possible implementation manner of the fifth aspect.

In a fourteenth aspect, there is provided a computer program product containing instructions for causing a computer to perform the training method of the multi-modal processing module of the third aspect or any possible implementation manner of the third aspect when the computer program product runs on a computer.

A fifteenth aspect provides a computer-readable storage medium storing program code for execution by a device, the program code comprising instructions for performing the method of controlling vehicle travel of the first aspect or any of the possible implementations of the first aspect; and/or performing the other method for controlling vehicle travel in the fifth aspect or any possible implementation manner of the fifth aspect.

In a sixteenth aspect, a computer-readable storage medium is provided, which stores program code for execution by a device, the program code comprising a training method for executing the multi-modal processing module of the third aspect or any possible implementation manner of the third aspect.

A seventeenth aspect provides a chip, where the chip includes a processor and a data interface, where the processor reads instructions stored in a memory through the data interface, and executes the method for controlling vehicle driving in the first aspect or any possible implementation manner of the first aspect; and/or performing the other method for controlling vehicle travel in the fifth aspect or any possible implementation manner of the fifth aspect.

Optionally, as an implementation manner, the chip may further include a memory, where instructions are stored in the memory, and the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the processor is configured to execute the method for controlling vehicle running in the first aspect or any possible implementation manner of the first aspect; and/or performing the other method for controlling vehicle travel in the fifth aspect or any possible implementation manner of the fifth aspect.

In an eighteenth aspect, a chip is provided, where the chip includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface to execute the training method of the multi-modal processing module in the third aspect or any possible implementation manner of the third aspect.

Optionally, as an implementation manner, the chip may further include a memory, where the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, and when the instructions are executed, the processor is configured to execute the training method of the multi-modal processing module in the third aspect or any possible implementation manner of the third aspect.

Drawings

FIG. 1 is a functional block diagram of a vehicle provided in an embodiment of the present application;

FIG. 2 is an exemplary diagram of an autopilot system to which embodiments of the present application are applicable;

FIG. 3 is a diagram illustrating an example of an application of a cloud-side command autonomous vehicle according to an embodiment of the present disclosure;

FIG. 4 is a diagram illustrating an exemplary method for controlling vehicle driving according to an embodiment of the present application;

FIG. 5 is a diagram illustrating an exemplary system architecture provided by an embodiment of the present application;

FIG. 6 is an exemplary diagram of one particular implementation provided by an embodiment of the present application;

FIG. 7 is an exemplary diagram of another specific implementation provided by an embodiment of the present application;

FIG. 8 is an exemplary diagram of a multi-modal processing method provided by an embodiment of the present application;

FIG. 9 is a diagram illustrating another example of a multi-modal processing method provided in an embodiment of the present application;

FIG. 10 is a diagram illustrating an example of a training method for a multi-modal processing module according to an embodiment of the present application;

FIG. 11 is an exemplary diagram of an application scenario provided by an embodiment of the present application;

fig. 12 is an exemplary diagram of an apparatus for controlling vehicle driving according to an embodiment of the present application;

FIG. 13 is a training apparatus of a multi-modal processing module provided in an embodiment of the present application;

FIG. 14 is a diagram illustrating an exemplary structure of an apparatus according to an embodiment of the present disclosure;

fig. 15 is an exemplary diagram of a computer program product provided in an embodiment of the present application.

Detailed Description

The technical solution in the present application will be described below with reference to the accompanying drawings.

Fig. 1 is a functional block diagram of a vehicle according to an embodiment of the present application. In one embodiment, the vehicle 100 is configured in a fully or partially autonomous driving mode.

For example, the vehicle 100 may control itself while in the autonomous driving mode, and may determine a current state of the vehicle and its surroundings by human operation, determine a possible behavior of at least one other vehicle in the surroundings, and determine a confidence level corresponding to the possibility of the other vehicle performing the possible behavior, controlling the vehicle 100 based on the determined information. While the vehicle 100 is in the autonomous driving mode, the vehicle 100 may be placed into operation without human interaction.

The vehicle 100 may include various subsystems such as a travel system 102, a sensor system 104, a control system 106, one or more peripherals 108, as well as a power supply 110, a computer system 112, and a user interface 116. Alternatively, vehicle 100 may include more or fewer subsystems, and each subsystem may include multiple elements. In addition, each of the sub-systems and elements of the vehicle 100 may be interconnected by wire or wirelessly.

The travel system 102 may include components that provide powered motion to the vehicle 100. In one embodiment, the travel system 102 may include an engine 118, an energy source 119, a transmission 120, and wheels/tires 121. The engine 118 may be an internal combustion engine, an electric motor, an air compression engine, or other types of engine combinations, such as a hybrid engine of a gasoline engine and an electric motor, a hybrid engine of an internal combustion engine and an air compression engine. The engine 118 converts the energy source 119 into mechanical energy.

Examples of energy sources 119 include gasoline, diesel, other petroleum-based fuels, propane, other compressed gas-based fuels, ethanol, solar panels, batteries, and other sources of electrical power. The energy source 119 may also provide energy to other systems of the vehicle 100.

The transmission 120 may transmit mechanical power from the engine 118 to the wheels 121. The transmission 120 may include a gearbox, a differential, and a drive shaft. In one embodiment, the transmission 120 may also include other devices, such as a clutch. Wherein the drive shaft may comprise one or more shafts that may be coupled to one or more wheels 121.

The sensor system 104 may include a number of sensors that sense information about the environment surrounding the vehicle 100. For example, the sensor system 104 may include a positioning system 122 (which may be a Global Positioning System (GPS) system, a Beidou system, or other positioning system), an Inertial Measurement Unit (IMU) 124, a radar 126, a laser rangefinder 128, and a camera 130. The sensor system 104 may also include sensors of internal systems of the monitored vehicle 100 (e.g., an in-vehicle air quality monitor, a fuel gauge, an oil temperature gauge, etc.). Sensor data from one or more of these sensors may be used to detect the object and its corresponding characteristics (position, shape, orientation, velocity, etc.). Such detection and identification is a critical function of the safe operation of the autonomous vehicle 100.

The positioning system 122 may be used to estimate the geographic location of the vehicle 100. The IMU 124 is used to sense position and orientation changes of the vehicle 100 based on inertial acceleration. In one embodiment, IMU 124 may be a combination of an accelerometer and a gyroscope.

The radar 126 may utilize radio signals to sense objects within the surrounding environment of the vehicle 100. In some embodiments, in addition to sensing objects, radar 126 may also be used to sense the speed and/or heading of an object.

The laser rangefinder 128 may utilize laser light to sense objects in the environment in which the vehicle 100 is located. In some embodiments, the laser rangefinder 128 may include one or more laser sources, laser scanners, and one or more detectors, among other system components.

The camera 130 may be used to capture multiple images of the surrounding environment of the vehicle 100. The camera 130 may be a still camera or a video camera.

The control system 106 is for controlling the operation of the vehicle 100 and its components. The control system 106 may include various elements including a steering system 132, a throttle 134, a braking unit 136, a sensor fusion algorithm 138, a computer vision system 140, a route control system 142, and an obstacle avoidance system 144.

The steering system 132 is operable to adjust the heading of the vehicle 100. For example, in one embodiment, a steering wheel system.

The throttle 134 is used to control the operating speed of the engine 118 and thus the speed of the vehicle 100.

The brake unit 136 is used to control the deceleration of the vehicle 100. The brake unit 136 may use friction to slow the wheel 121. In other embodiments, the brake unit 136 may convert the kinetic energy of the wheel 121 into an electric current. The brake unit 136 may take other forms to slow the rotational speed of the wheels 121 to control the speed of the vehicle 100.

The computer vision system 140 may be operable to process and analyze images captured by the camera 130 to identify objects and/or features in the environment surrounding the vehicle 100. The objects and/or features may include traffic signals, road boundaries, and obstacles. The computer vision system 140 may use object recognition algorithms, Motion from Motion (SFM) algorithms, video tracking, and other computer vision techniques. In some embodiments, the computer vision system 140 may be used to map an environment, track objects, estimate the speed of objects, and so forth.

The route control system 142 is used to determine a travel route of the vehicle 100. In some embodiments, the route control system 142 may combine data from the sensors 138, the GPS122, and one or more predetermined maps to determine a travel route for the vehicle 100.

The obstacle avoidance system 144 is used to identify, evaluate, and avoid or otherwise negotiate potential obstacles in the environment of the vehicle 100.

Of course, in one example, the control system 106 may additionally or alternatively include components other than those shown and described. Or may reduce some of the components shown above.

Vehicle 100 interacts with external sensors, other vehicles, other computer systems, or users through peripherals 108. The peripheral devices 108 may include a wireless communication system 146, an in-vehicle computer 148, a microphone 150, and/or speakers 152.

In some embodiments, the peripheral devices 108 provide a means for a user of the vehicle 100 to interact with the user interface 116. For example, the onboard computer 148 may provide information to a user of the vehicle 100. The user interface 116 may also operate the in-vehicle computer 148 to receive user input. The in-vehicle computer 148 may be operated via a touch screen. In other cases, the peripheral devices 108 may provide a means for the vehicle 100 to communicate with other devices located within the vehicle. For example, the microphone 150 may receive audio (e.g., voice commands or other audio input) from a user of the vehicle 100. Similarly, the speaker 152 may output audio to a user of the vehicle 100.

The wireless communication system 146 may communicate wirelessly with one or more devices, either directly or via a communication network. For example, the wireless communication system 146 may use 3G cellular communication such as Code Division Multiple Access (CDMA), global system for mobile communications (GSM), General Packet Radio Service (GPRS), or 4G cellular communication such as Long Term Evolution (LTE), or 5G cellular communication. The wireless communication system 146 may communicate with a Wireless Local Area Network (WLAN) using WiFi. In some embodiments, the wireless communication system 146 may utilize an infrared link, bluetooth, or the like to communicate directly with the device. Other wireless protocols, such as various vehicle communication systems, for example, the wireless communication system 146 may include one or more Dedicated Short Range Communications (DSRC) devices that may include public and/or private data communications between vehicles and/or roadside stations.

The power supply 110 may provide power to various components of the vehicle 100. In one embodiment, power source 110 may be a rechargeable lithium ion or lead acid battery. One or more battery packs of such batteries may be configured as a power source to provide power to various components of the vehicle 100. In some embodiments, the power source 110 and the energy source 119 may be implemented together, such as in some all-electric vehicles.

Some or all of the functionality of the vehicle 100 is controlled by the computer system 112. The computer system 112 may include at least one processor 113, the processor 113 executing instructions 115 stored in a non-transitory computer readable medium, such as the memory 114. The computer system 112 may also be a plurality of computing devices that control individual components or subsystems of the vehicle 100 in a distributed manner.

The processor 113 may be any conventional processor, such as a commercially available CPU. Alternatively, the processor may be a dedicated device such as an ASIC or other hardware-based processor. Although fig. 1 functionally illustrates processors, memories, and other elements of the computer 110 in the same blocks, those of ordinary skill in the art will appreciate that the processors, computers, or memories may actually comprise multiple processors, computers, or memories that may or may not be stored within the same physical housing. For example, the memory may be a hard disk drive or other storage medium located in a different housing than the computer 110. Thus, references to a processor or computer are to be understood as including references to a collection of processors or computers or memories which may or may not operate in parallel. Rather than using a single processor to perform the steps described herein, some components, such as the steering component and the retarding component, may each have their own processor that performs only computations related to the component-specific functions.

In various aspects described herein, the processor may be located remotely from the vehicle and in wireless communication with the vehicle. In other aspects, some of the processes described herein are executed on a processor disposed within the vehicle and others are executed by a remote processor, including taking the steps necessary to perform a single maneuver.

In some embodiments, the memory 114 may include instructions 115 (e.g., program logic), and the instructions 115 may be executed by the processor 113 to perform various functions of the vehicle 100, including those described above. The memory 114 may also contain additional instructions, including instructions to send data to, receive data from, interact with, and/or control one or more of the travel system 102, the sensor system 104, the control system 106, and the peripheral devices 108.

In addition to instructions 115, memory 114 may also store data such as road maps, route information, the location, direction, speed of the vehicle, and other such vehicle data, among other information. Such information may be used by the vehicle 100 and the computer system 112 during operation of the vehicle 100 in autonomous, semi-autonomous, and/or manual modes.

A user interface 116 for providing information to and receiving information from a user of the vehicle 100. Optionally, the user interface 116 may include one or more input/output devices within the collection of peripheral devices 108, such as a wireless communication system 146, an on-board vehicle computer 148, a microphone 150, and a speaker 152.

The computer system 112 may control the functions of the vehicle 100 based on inputs received from various subsystems (e.g., the travel system 102, the sensor system 104, and the control system 106) and from the user interface 116. For example, the computer system 112 may utilize input from the control system 106 in order to control the steering unit 132 to avoid obstacles detected by the sensor system 104 and the obstacle avoidance system 144. In some embodiments, the computer system 112 is operable to provide control over many aspects of the vehicle 100 and its subsystems.

Alternatively, one or more of these components described above may be mounted or associated separately from the vehicle 100. For example, the memory 114 may exist partially or completely separate from the vehicle 100. The above components may be communicatively coupled together in a wired and/or wireless manner.

Optionally, the above components are only an example, in an actual application, components in the above modules may be added or deleted according to an actual need, and fig. 1 should not be construed as limiting the embodiment of the present application.

An autonomous automobile traveling on a roadway, such as vehicle 100 above, may identify objects within its surrounding environment to determine an adjustment to the current speed. The object may be another vehicle, a traffic control device, or another type of object. In some examples, each identified object may be considered independently, and based on the respective characteristics of the object, such as its current speed, acceleration, separation from the vehicle, etc., may be used to determine the speed at which the autonomous vehicle is to be adjusted.

Optionally, the autonomous automobile vehicle 100 or a computing device associated with the autonomous vehicle 100 (e.g., the computer system 112, the computer vision system 140, the memory 114 of fig. 1) may predict behavior of the identified objects based on characteristics of the identified objects and the state of the surrounding environment (e.g., traffic, rain, ice on the road, etc.). Optionally, each identified object depends on the behavior of each other, so it is also possible to predict the behavior of a single identified object taking all identified objects together into account. The vehicle 100 is able to adjust its speed based on the predicted behaviour of said identified object. In other words, the autonomous vehicle is able to determine what steady state the vehicle will need to adjust to (e.g., accelerate, decelerate, or stop) based on the predicted behavior of the object. In this process, other factors may also be considered to determine the speed of the vehicle 100, such as the lateral position of the vehicle 100 in the road on which it is traveling, the curvature of the road, the proximity of static and dynamic objects, and so forth.

In addition to providing instructions to adjust the speed of the autonomous vehicle, the computing device may also provide instructions to modify the steering angle of the vehicle 100 to cause the autonomous vehicle to follow a given trajectory and/or to maintain a safe lateral and longitudinal distance from objects in the vicinity of the autonomous vehicle (e.g., cars in adjacent lanes on the road).

Optionally, the autonomous vehicle 100 or a computing device associated with the autonomous vehicle 100 (e.g., computer system 112, computer vision system 140, memory 114 of fig. 1) may also predict whether autonomous driving is available in the road ahead based on the state of the vehicle and the detected environmental information, and control the switching between autonomous and manual driving modes.

The vehicle 100 may be a car, a truck, a motorcycle, a bus, a boat, an airplane, a helicopter, a lawn mower, an amusement car, a playground vehicle, construction equipment, a trolley, a golf cart, a train, a trolley, etc., and the embodiment of the present invention is not particularly limited.

Fig. 2 is an exemplary diagram of an automatic driving system provided in an embodiment of the present application.

The autopilot system shown in fig. 2 includes a computer system 101, wherein computer system 101 includes a processor 103, and processor 103 is coupled to a system bus 105. Processor 103 may be one or more processors, each of which may include one or more processor cores. A display adapter (video adapter)107, which may drive a display 109, the display 109 coupled with system bus 105. System bus 105 is coupled to an input/output (I/O) bus 113 via a bus bridge 111. The I/O interface 115 is coupled to an I/O bus. The I/O interface 115 communicates with various I/O devices, such as an input device 117 (e.g., keyboard, mouse, touch screen, etc.), a multimedia disk (media track) 121 (e.g., compact disk read-only memory (CD-ROM), multimedia interface, etc.). A transceiver 123 (which can send and/or receive radio communication signals), a camera 155 (which can capture field and motion digital video images), and an external Universal Serial Bus (USB) interface 125. Wherein, optionally, the interface connected with the I/O interface 115 may be a USB interface.

The processor 103 may be any conventional processor, including a Reduced Instruction Set Computing (RISC) processor, a Complex Instruction Set Computing (CISC) processor, or a combination thereof. Alternatively, the processor may be a dedicated device such as an Application Specific Integrated Circuit (ASIC). Alternatively, the processor 103 may be a neural network processor or a combination of a neural network processor and a conventional processor as described above.

Optionally, in various embodiments described herein, computer system 101 may be located remotely from the autonomous vehicle and may communicate wirelessly with the autonomous vehicle. In other aspects, some processes described herein are performed on a processor disposed within an autonomous vehicle, others being performed by a remote processor, including taking the actions required to perform a single maneuver.

Computer 101 may communicate with software deploying server 149 via network interface 129. The network interface 129 is a hardware network interface, such as a network card. The network 127 may be an external network, such as the internet, or an internal network, such as an ethernet or Virtual Private Network (VPN). Optionally, the network 127 may also be a wireless network, such as a WiFi network, a cellular network, and the like.

The hard drive interface is coupled to system bus 105. The hardware drive interface is connected with the hard disk drive. System memory 135 is coupled to system bus 105. Data running in system memory 135 may include the operating system 137 and application programs 143 of computer 101.

The operating system includes a parser 139(shell) and a kernel 141 (kernel). The shell 139 is an interface between the user and the kernel of the operating system. The shell is the outermost layer of the operating system. The shell manages the interaction between users and the operating system: waits for user input, interprets the user input to the operating system, and processes the output results of the various operating systems.

Kernel 141 is comprised of those portions of the operating system that are used to manage memory, files, peripherals, and system resources. Interacting directly with the hardware, the operating system kernel typically runs processes and provides inter-process communication, CPU slot management, interrupts, memory management, IO management, and the like.

The application programs 143 include programs related to controlling the automatic driving of a vehicle, such as programs for managing the interaction of an automatically driven vehicle with obstacles on the road, programs for controlling the route or speed of an automatically driven vehicle, and programs for controlling the interaction of an automatically driven vehicle with other automatically driven vehicles on the road. Application 143 also resides on the system of the exploiting server 149. In one embodiment, computer system 101 may download application program 143 from deploying server14 when needed to execute application program 147.

For example, application 141 may be a program that controls the automatic drive vehicle to turn on or off the assisted automatic drive function.

Sensor 153 is associated with computer system 101. The sensor 153 is used to detect the environment surrounding the computer 101. For example, the sensor 153 may detect an animal, a car, an obstacle, a crosswalk, and the like, and further, the sensor may detect an environment around the animal, the car, the obstacle, the crosswalk, and the like, such as: the environment surrounding the animal, e.g., other animals present around the animal, weather conditions, brightness of the surrounding environment, etc. Alternatively, if the computer 101 is located on an autonomous automobile, the sensor may be a camera, infrared sensor, chemical detector, microphone, or the like.

Computer system 112 in FIG. 1 may also receive information from, or transfer information to, other computer systems. Alternatively, sensor data collected from the sensor system 104 of the vehicle 100 may be transferred to another computer for processing of this data.

For example, as shown in fig. 3, data from the computer system 312 may be transmitted via a network to a server 320 (also referred to as a cloud) on the cloud side for further processing. The networks and intermediate nodes may include various configurations and protocols, including the internet, world wide web, intranets, virtual private networks, wide area networks, local area networks, private networks using one or more company's proprietary communication protocols, ethernet, WiFi, and hypertext transfer protocol (HTTP), as well as various combinations of the foregoing. Such communications may be by any device capable of communicating data to and from other computers, such as modems and wireless interfaces. For example, data such as the state of the vehicle and environmental information are transmitted to the cloud-side server 320 for further processing, and the cloud-side server may recognize and process the data by using various neural network models, and feed back the recognition result to the computer system 312, so that the computer system 312 may confirm whether to turn on or off the auxiliary automatic driving function.

In one example, server 320 may comprise a server having multiple computers, such as a load balancing server farm, that exchange information with different nodes of a network for the purpose of receiving, processing, and transmitting data from computer system 312. The server may be configured similar to computer system 312, with processor 330, memory 340, instructions 350, and data 360.

The autopilot system may include several auxiliary autopilot functions. Such as pre-collision safety braking (PCS), Adaptive Cruise Control (ACC), Lane Keeping Aid (LKA), Cross Traffic Alert (CTA), Rear Cross Traffic Alert (RCTA), Blind Spot Warning (BSW), turn off vehicle alert, and traffic jam aid (tjajas).

At present, the driving basis of an automatic driving vehicle is to send a user to a corresponding destination through a planned route according to a preset destination and the surrounding environment of the vehicle obtained through various sensors. However, during actual vehicle travel, the user may have some temporary intention different from the intention to travel to the destination based on visual information around the vehicle, such as: seeing an acquaintance at the roadside, needing to temporarily park, and calling him; feel to be closer to the front car, need to pull apart the distance, etc.

However, in the conventional automatic driving technology, if the user generates the temporary intention, the user can temporarily take over the control right of the vehicle only by means of manual intervention and then execute the temporary intention related to the user. Since the vehicle has now been switched to an artificial driving mode, the user can no longer enjoy the more worry-saving, safer driving experience brought about by the automated driving technique. In addition, when the Level of the automatic driving is at Level 5 (Level 5, L5) (according to the definition of Society of Automotive Engineers (SAE) on the automation Level), the manual intervention function of the vehicle may be cancelled, and the driver will not be able to perform the above temporary intention, so that the user experience is degraded.

In view of the above problems, the present application provides a method for controlling vehicle driving, so that when a user generates a temporary intention while an autonomous vehicle is driving in an autonomous driving mode, the user's driving intention can be determined by performing multi-modal understanding on a user instruction and environmental information around the vehicle, and the motion of the vehicle can be controlled according to the user's driving intention. Therefore, the execution of the temporary intention of the user can be realized in the automatic driving mode, and the experience of the user in the automatic driving process can be further improved.

Fig. 4 is a diagram illustrating an example of a method for controlling vehicle driving according to an embodiment of the present application. It should be appreciated that the method of FIG. 4 may be implemented in the vehicle of FIG. 1 or the autonomous driving system of FIG. 2. It should be appreciated that the method illustrated in FIG. 4 is performed in an autonomous driving mode.

As shown in fig. 4, the method 400 includes steps S410 to S440, which are described in detail below.

And S410, acquiring a user instruction in an automatic driving mode of the vehicle.

Optionally, the user instruction comprises: any one or more of a user natural voice instruction (i.e., a user voice instruction), a user text instruction, a user space gesture instruction, and the like, which is not limited in the present application.

It should be understood that during driving in the automatic driving mode of the vehicle, if the user makes a temporary intention, for example: seeing an acquaintance at the roadside, needing to temporarily park, and calling him; feel to be closer to the front car, need to pull apart the distance, etc. The temporary intention may be input to the relevant in-vehicle device by means of a user instruction. For example, a temporary intent is input into the microphone by way of natural voice instructions; for another example, the temporary intention is input to the relevant user action acquisition device by way of an air gesture command; for another example, the temporary intent is directly input to the relevant text entry device by way of a text instruction, which is not limited in this application.

Optionally, if the user instruction obtained in step S410 is limited to obtain a user text instruction, in an actual operation, the user text instruction may be directly obtained from the user through a related text entry device, or a user voice instruction or an idle gesture instruction may be obtained from another device first, and then converted into a text instruction through the related device, where an obtaining manner of the text instruction is not limited in the present application. For example, if the user generates a temporary intention, the user may speak his or her intention to an associated in-vehicle device (e.g., a microphone) in the vehicle using natural voice. Alternatively, converting the natural speech instruction into a text instruction may be achieved by Automatic Speech Recognition (ASR). At this time, the text instruction of the user is obtained, and specifically, the text instruction may be obtained from the ASR. Illustratively, the clear gesture command may be converted to a text command by an associated gesture recognition device.

It should be understood that, for convenience of description, in the following embodiments, a text instruction of a user will be taken as an example for description, but it should be understood that this does not constitute a limitation to the present application.

And S420, acquiring environmental information around the vehicle.

It is to be understood that the environmental information around the vehicle may be acquired by the photographing device, specifically, an image or video is acquired by the photographing device to reflect the environmental information by information in the image or video; environmental information that also can obtain through laser radar, vehicle sensor and/or car networking etc. this application does not limit this. For convenience of description, a scheme will be described in the present application taking an example in which a photographing device acquires environment information.

It should be understood that, in actual operation, the shooting device may acquire video information or image information, or may acquire video information around the vehicle first and then acquire image information from the video, which is not limited in this application. For convenience of description, in the following embodiments, the image information captured by the capturing device is taken as an example for description, but it should be understood that the present application is not limited thereto.

Alternatively, after the user instruction is acquired, a shooting activation signal may be sent to the shooting device to activate the shooting device to shoot image information (i.e., environmental information) around the vehicle. After the photographing device photographs the surrounding image information, the surrounding image information photographed by the photographing device is acquired.

Alternatively, the photographing means may periodically photograph image information around the vehicle. Acquiring the image information around the vehicle at this time may include: acquiring image information around the vehicle periodically captured by a capturing device.

In this case, in order to perform multi-modal understanding as described below, it is necessary to select an appropriate image from image information of the surroundings of the vehicle which is periodically captured, and perform multi-modal understanding.

The appropriate image information may be image information that has been newly captured by the imaging device, or image information that corresponds to a specific time interval estimated from the recognition time of a natural voice command, an interval gesture command, or the like. The image information corresponding to the text instruction may also be obtained, and the image information should be specifically selected in combination with the actual situation, which is not limited in the present application.

S430, a multi-modal understanding of the user command and the environmental information around the vehicle is performed, and the driving intention of the user is determined. Alternatively, the first and second electrodes may be,

the step S430 may be: and determining the driving intention of the user according to the user instruction and the environmental information around the vehicle. This means that the present invention does not limit the manner in which the travel intention of the user is specified based on the user command and the environmental information around the vehicle, and means that the travel intention of the user can be specified by performing multi-modal understanding of the user command and the environmental information around the vehicle, or can be specified by other manners, which is not limited in the present invention. However, as a preferable aspect, in the following, a multi-modal understanding of the user instruction and the environmental information around the vehicle, and the determination of the travel intention of the user will be described as an example.

In the present application, after the user instruction and the environmental information around the vehicle are acquired, multi-modal understanding can be performed to determine the driving intention of the user. Meaning that step S430 can be done in a multi-modal processing module (i.e., multi-modal processing module 540 in fig. 5). The module will be described with reference to fig. 5, and the procedure of the multi-modal processing will be described with reference to fig. 8 and 9, which will not be described herein again.

Optionally, the driving intent includes at least one intent, each intent of the at least one intent includes n slots, each slot of the n slots includes a slot name, a slot value, and a classification of slot values, n is greater than or equal to 0, and n is an integer.

Optionally, the intent may include: at least one of parking, overtaking, decelerating, following, steering, etc. It is to be understood that other intents may be included in the practice, and are not limited thereto.

Optionally, the slot name may include: at least one of a position of parking, a speed value, an object of passing or following, a steering direction, and the like. It should be understood that other slot names may be included in the actual operation, and the present application is not limited thereto.

Optionally, the classification of the slot bit value may be: enumerating a class slot position value, a text class slot position value or an environment class slot position value.

Wherein the enumeration-type slot value indicates that the slot bit value is a predefined enumeration value. For example: the user instruction is "turn right at the next intersection", at this time, there is a slot corresponding to the turning direction, since the turning direction is enumerable, for example: the steering orientation has only four options: left, right, straight going, turning around. At this time, the slot value of the slot "steering direction" is "right", and the slot value may be understood as an enumeration type slot bit value.

The text type slot value indicates that the slot value is a substring in a user instruction or a text generated according to the user instruction, and the slot value at the moment is a non-enumerable value. For example: the user instruction is 'stop beside a gas station', at this time, a slot corresponding to the position of the stop exists, and since the position of the stop is not enumerable, at this time, a substring 'beside the gas station' in the instruction can be used as a slot value, and the slot value can be understood as a text type slot value. For example, if the user instructs "stop at a luxurious hotel in the front", a slot corresponding to the position of the stop exists, and the position of the stop is not enumerable, then the text "high-ranked hotel" generated according to the instruction may be used as the slot value, and the slot value may be understood as a text-based slot value. It should be understood that the above mentioned is described below by taking the user text instruction as an example. Then, the text class slot value indicates that the slot value may be a substring in the user text instruction or a text generated according to the user text instruction, and this is taken as an example in the following embodiments.

The context class slot value represents an identification that the slot value is made in the context information according to what is mentioned in the user instruction. Alternatively, when the environment information is acquired by the camera, the environment information may be image information, and the environment-like slot value may also be an image-like slot value, where the image may reflect the environment around the vehicle. Thus, the image class slot value represents an identification that the slot value is made in the image information according to what is mentioned in the user instruction. For example, in the scenario shown in fig. 11, when the user instructs "drive to blue car position, park at the side", there is a slot corresponding to the position of parking, and since the position of parking is "blue car position", the "blue car" can be identified in the image information by using a rectangular frame (shown in fig. 11), and at this time, the rectangular frame is a slot value, and the slot value can be understood as an image type slot value. And it should be understood that the image class slot values will be described below as examples, which are not limited in this application.

It should be understood that the above-mentioned "travel intention includes at least one intention", meaning that a travel intention may include one intention or may include a plurality of intents at the same time. For example, when the user instruction is "turn right at the next intersection", a turning intention is included; when the user commands that the user turns right at the next intersection and parks, the user commands comprise a turning intention and a parking intention.

Also mentioned above is that "each of the at least one intent includes n slots, each of the n slots including a slot name, a slot value, and a classification of slot values, n being greater than or equal to 0, n being an integer", meaning that the intent may include one or more slots describing the intent, or may not include slots. If the intent includes a slot describing the intent, then each corresponding slot includes a slot name, a slot value, and a classification of slot values.

For example, when the user instruction is "parking", it indicates that the user intends to park, and a slot where the user instruction is not intended is not described at this time, and the user instruction may be directly subjected to the subsequent operation based on the intention.

For another example, if the user instructs "stop at a gas station ahead", and there are a plurality of slots describing intentions (stop), the slot name, the slot value, and the classification of the slot value corresponding to the slot may be listed according to the user instruction. For example, the slot name, the slot value and the classification of the slot value corresponding to the first slot of the parking intention may be a position of parking, a front gas station, a text type slot value, respectively; the slot name, slot value and slot value corresponding to the second slot of the parking intention may be the parking position, the rectangular frame (identifying the front gas station in the image information), and the image type slot value, respectively.

Meanwhile, based on the above, it can be seen that, in the same driving intention, there may be a classification of one slot value or a classification of multiple slot values, and specific conditions need to be analyzed, which is not exhaustive in the present application.

Alternatively, the driving intention of the user may be presented to the user through an augmented reality-head up display (AR-HUD), a center screen, or the like, so that the user can judge the correctness of the multi-modal understanding result in time.

For example, when the driving intention includes the environment-like slot position value, the user-mentioned object may be presented on the windshield by the AR-HUD (e.g., a rectangular frame shown in fig. 11 (a)), or may be displayed by a center screen or the like.

S440, an automatic driving control command for the vehicle is generated according to the driving intention of the user.

Alternatively, an automatic driving control instruction for the vehicle may be generated in accordance with the obtained travel intention. So that the vehicle can be controlled according to the automatic driving control instruction in the automatic driving mode.

In the process of automatic driving, the rules of automatic driving should be observed, that is, driving should be performed according to the surrounding environment conditions, and traffic regulations and the like cannot be violated.

Thus, alternatively, it may be determined whether the travel intention is feasible or not first according to the travel intention, the surrounding environment, and the traffic regulation; and if the driving intention is feasible, regenerating an automatic driving control command for the vehicle. Specifically, reference may be made to the following description of steps 10 and 11 in fig. 6.

Alternatively, if the driving intention is not feasible, a prompt message can be generated and sent to the user. Optionally, the prompting message may further include a reason why the driving intention is not feasible.

Optionally, if the driving intention is feasible, the vehicle may also prompt the user in a voice broadcast manner, such as "parking for you is being performed"; the target path and the target position to be traveled by the vehicle can also be displayed to the user by means of an AR-HUD, a center screen, or the like (e.g., dynamic arrows and boxes shown in fig. 11 (b)).

Optionally, the method 400 may be executed on a cloud server or an edge cloud server, or may be executed in a computer system of a vehicle, which is not limited in this application.

Fig. 5 is a diagram illustrating an exemplary system architecture according to an embodiment of the present application. It should be understood that this system architecture is only an example, and should not be construed as limiting the present application. As shown in fig. 5, the system architecture 500 includes: a microphone 510, an Automatic Speech Recognition (ASR) module 520, a camera 530 (i.e., a camera), a multi-modal processing module 540, a decision plan calculation module 550, and a vehicle motion control module 560. These modules are described below separately.

Microphone 510: the microphone or the microphone group is arranged in a vehicle cabin and is used for collecting audio information of a user in the cabin, namely the voice command of the user related to the application, and can also be called as a natural voice command of the user.

The ASR module 520: for recognizing the user's natural language instructions collected by the microphone 510 and converting the user's natural language instructions into text instructions.

The camera 530: the camera or the camera group is deployed on the vehicle and used for collecting image information around the vehicle.

The multi-modality processing module 540: primarily contains a multimodal intent recognition engine. The system is used for receiving the text instruction recognized by the ASR module 520 and the image information collected by the camera 530, and generating a corresponding driving intention according to the text instruction and the image information. And in some cases, the multi-modality processing module 540 can also be used to control the camera 530 for image information acquisition, as shown in embodiment 1 below.

Decision plan calculation module 550: the system is used for judging the driving intention generated by the multi-mode processing module 540 according to the conditions of traffic regulations, surrounding environment and the like, and determining whether the driving intention is feasible or not. If adjustment is required, the travel intent is adjusted and a vehicle control command is generated.

Vehicle motion control module 560: for controlling the vehicle motion in accordance with the vehicle control commands of the decision plan calculation module 550.

It should be understood that the physical deployment of the various components or modules above may be deployed alone or in any combination. It should be understood that in the case of a combined deployment, forwarding of information between the combined modules may not be necessary.

It should be understood that all of the components or modules in the system architecture described above may be deployed in a vehicle; parts or modules may also be connected, for example: the ASR module 520, the multi-modal processing module 540, and the decision-making plan calculation module 550 are partially or entirely deployed on a cloud server or an edge cloud server, and the others are deployed on a vehicle, and the scheme of the present application is implemented in a vehicle-cloud interaction manner, which is not limited in the present application.

Based on the system architecture 500, a detailed implementation of the present application will be described in detail below with reference to fig. 6 to 9.

Fig. 6 is an exemplary diagram of a specific implementation manner provided by an embodiment of the present application. As shown in fig. 6, the specific implementation includes steps 1 to 11, which are described in detail below.

Step 1, a user issues a voice instruction.

When a user on the vehicle temporarily makes a new driving intention while the vehicle is automatically driven according to a destination input in advance, the user can speak his or her intention to the microphone 510 in the vehicle in the form of voice.

And 2, sending a natural voice command.

The microphone 510 sends the received natural speech instructions to the ASR module 520.

And 3, voice recognition.

The ASR module 520 performs speech recognition on the received speech instruction and recognizes a text instruction corresponding to the speech instruction.

And 4, transmitting a user text instruction.

The ASR module 520 transmits the recognized text instructions to the multimodal processing module 530.

And 5, sending a shooting activation signal.

After receiving the text instruction, the multi-modal processing module 530 sends a shooting activation signal to the camera 530, so as to activate the camera 530 to collect surrounding image information.

And 6, shooting image information around the vehicle.

After the camera 530 receives the photographing activation signal, image information around the vehicle is photographed.

And 7, transmitting the image information around the vehicle.

The camera 530 transmits the photographed image information around the vehicle to the multi-modal processing module 540.

And 8, performing multi-modal understanding based on the text instruction and the image information.

The multi-modal processing module 540 performs multi-modal understanding based on the text instruction and the image information, and obtains the driving intention of the user.

It should be understood that the details have been described above with respect to the travel intent and are not described in detail herein. In addition, the process of performing multi-modal understanding with respect to the multi-modal processing module 540 will be described below in conjunction with fig. 8 and 9.

And 9, transmitting the driving intention.

The multi-modal processing module 540 sends the driving intent identified in step 8 to the decision plan calculation module 550.

And step 10, judging whether the intention is feasible or not.

Since the user's driving intention may not comply with traffic regulations (e.g., the user requires a one-way road to go backwards or requires parking at an intersection where parking is impossible, etc.); alternatively, the user's driving intent may not be achievable under the current ambient environment; or some other situation occurs such that the user's driving intent may not be fulfilled.

Therefore, the decision plan calculation module 550 needs to determine whether the driving intention is feasible according to the driving intention in combination with necessary information such as the surrounding environment and traffic regulations, and generate prompt information according to the determination result, and notify the prompt information to the user. For example: if the judgment result is not feasible, the driving intention can not be executed by the user, and the reason why the driving intention can not be executed by the user can be informed. If the determination result is feasible, step 11 is executed.

And 11, adjusting the vehicle driving parameters according to the driving intention, the surrounding environment, the traffic regulations and other information.

Specifically, if the determination result in step 10 is feasible, the decision-making plan calculation module 550 determines a specific vehicle motion control command according to the driving intention, the surrounding environment, the traffic regulations and other necessary information, and sends the specific vehicle motion control command to the vehicle motion control module 560. The vehicle motion control module 560 performs specific execution operations according to the vehicle motion control instructions.

It should be understood that after the travel intent is completed, the control instructions for the vehicle motion may be modified as appropriate so that the vehicle continues to travel in the autonomous driving mode to the final destination to which the user is to reach.

Fig. 7 is an exemplary diagram of another specific implementation manner provided by the embodiment of the present application. As shown in fig. 7, this embodiment includes steps 1 to 10, which are described in detail below.

Step 1 to step 4, refer to step 1 to step 4 in the previous implementation (in fig. 6), which is not described herein again.

And 5, periodically shooting the image information around the vehicle.

The camera 530 periodically captures image information around the vehicle.

And 6, transmitting the image information around the vehicle.

The camera 530 periodically transmits the photographed image information around the vehicle to the multi-modal processing module 540.

And 7, performing multi-modal understanding based on the text instruction and the image information.

The multi-modal processing module 540 obtains the driving intention of the user based on multi-modal understanding of the text instruction and the image information at an appropriate time.

The image information at the appropriate time may be the latest image information or image information corresponding to a specific time interval estimated from the recognition time of the natural language instruction.

Likewise, the driving intent has been described in detail above and will not be described in detail here. In addition, the process of performing multi-modal understanding with respect to the multi-modal processing module 540 will be described below in conjunction with fig. 8 and 9.

Step 8 to step 10, refer to step 9 to step 11 in the previous implementation (in fig. 6), which is not described herein again.

Fig. 8 is an exemplary diagram of a multi-modal processing procedure provided in an embodiment of the present application.

As shown in fig. 8, the multimode processing mainly inputs a user command and environment information to the multimode processing module, and the multimode processing module performs a multimode understanding, and finally outputs a driving intention.

It should be understood that the multi-modal processing module is derived through pre-training. Specifically, in the training process, a user instruction (such as a user voice instruction, a user text instruction, or a user space gesture instruction), environment information (such as image information), and a corresponding driving intention may be used as training data to train the multi-modal processing module, as shown in fig. 10. Therefore, in the application stage of the multi-mode processing module, after the user instruction and the environmental information are input, the corresponding driving intention can be output.

Fig. 9 is an exemplary diagram of another multi-modal processing procedure provided in an embodiment of the present application. In fig. 9, a text instruction is used as a user instruction, and image information is used as environment information. It should be understood that fig. 9 is only an example of a structure of the multi-modal processing module shown in fig. 8, and is not intended to limit the present application. It should be understood that, in practice, the structure of the multi-modal processing module may take other forms, and the structure of the multi-modal processing module may be composed of other processing models, networks or modules as long as the driving intention can be output according to the input text command and the image information. The multi-modal processing procedure in this example is described below in conjunction with fig. 9.

As shown in fig. 9, the multi-modal processing module may include a text processing model, a Convolutional Neural Network (CNN), an attention module att.1, and an attention module att.2. The text processing model may be a BERT model commonly used for text processing, or may be other models that can be used for text processing, which is not limited in this application. The CNN network may be, but is not limited to, a Deep residual network (ResNet) or the like.

In this example, the process of the multimodal processing module for travel intent understanding may be as follows:

after the multi-mode processing module obtains a text instruction and image information, extracting corresponding text features from the text instruction through a BERT model; the image information is passed through a CNN network (e.g., ResNet) to extract the corresponding image features.

Attention module att.1 is used for integrating the text features with the image features so as to obtain at least one intention and n slots corresponding to each intention in the at least one intention, wherein n is greater than or equal to 0 and is an integer. Each slot of the n slots includes a slot name, a slot value, and a classification of the slot value, where the classification of the slot value is an enumeration-type slot value, a text-type slot value, or an image-type slot value (see fig. 4 for a description of driving intentions).

If the slot position value of a certain slot corresponding to the intention obtained by the attention module att.1 is classified as an image class slot position value, the image feature is integrated with the text feature by the attention module att.2, so that the slot position value of the slot position, that is, the rectangular box of the object mentioned in the user text instruction, for example, the rectangular box corresponding to the blue car in fig. 11, is obtained.

In summary, the information obtained from att.1 and att.2 is the driving intention.

Fig. 10 is an exemplary diagram of a training method of a multi-modal processing module according to an embodiment of the present application. As shown in fig. 10, the training method 1000 includes steps S1010 and S1020, which are described below.

And S1010, acquiring training data.

The training data includes training input data including a user instruction and environmental information around the vehicle and training target data including a driving intention corresponding to the training input data.

Wherein the driving intent comprises at least one intent, each intent of the at least one intent comprises n slots, each slot of the n slots comprises a slot name, a slot value, and a classification of slot values, n is greater than or equal to 0, and n is an integer.

The intent includes: at least one of parking, overtaking, decelerating, following, steering, etc.

The slot names include: at least one of a position of parking, a speed value, an object of passing or following, a steering direction, and the like.

The classification of slot bit values is: enumerating a class slot position value, a text class slot position value or an environment class slot position value.

S1020, training the multi-modal processing module according to the training input data and the training target data.

Fig. 11 is an exemplary diagram of an application scenario provided in an embodiment of the present application. It should be understood that the application scenario shown in fig. 11 is only an example and is not to be construed as limiting the present application. This application scenario is described below in conjunction with fig. 11.

As shown in (a) of fig. 11, the user of the automated driving vehicle temporarily generates a new driving intention when the vehicle is driven in the automated driving mode according to a preset destination, and gives a natural voice instruction such as "driving to a blue position, parking side by side" to the vehicle (for example, a microphone on the vehicle) by voice. Subsequently, the relevant onboard device on the vehicle, such as an ASR module, recognizes the natural language instruction and converts it into a text instruction. Next, the device or related module on the vehicle for controlling the vehicle to travel determines the temporary intention of the user (i.e. the user needs to stop at the side of the blue vehicle in front) by the method 400, and then the device or related module generates a suitable vehicle control command according to the temporary intention of the vehicle to travel, and issues the command to the vehicle. In addition, the vehicle can also give feedback to the user in the form of voice broadcast and/or augmented reality head-up display (AR-HUD). As shown in fig. 11 (b), the vehicle may prompt the user by voice announcement, such as "parking is being performed for you"; the target path and the target position of the vehicle about to run can be displayed to the user in an AR-HUD mode.

It should be understood that the application scenario may also be understood as a user display interface capable of presenting a driving intention to the user, such as a rectangular box shown in fig. 11 (a), and also capable of presenting an upcoming path and a driving target position to the user, such as an arrow and a box shown in fig. 11 (b).

Fig. 12 is an exemplary diagram of an apparatus for controlling vehicle running according to an embodiment of the present application. As shown in fig. 12, the apparatus 1200 includes an acquisition unit 1210 and a processing unit 1220.

In an automatic driving mode of the vehicle, the obtaining unit 1210 is configured to obtain a user instruction.

The acquisition unit 1210 is further configured to acquire environmental information around the vehicle.

The processing unit 1220 is configured to perform multi-modal understanding of the user instruction and the environmental information around the vehicle, and determine the driving intention of the user.

The processing unit 1220 is further configured to generate an automatic driving control instruction for the vehicle according to the driving intention of the user.

Alternatively, the driving intent may include at least one intent, each intent of the at least one intent including n slots, each slot of the n slots including a slot name, a slot value, and a classification of the bit value, n being greater than or equal to 0, n being an integer.

Optionally, the intent may include: at least one of parking, overtaking, decelerating, following, steering, etc.

Optionally, the slot name may include: at least one of a position of parking, a speed value, an object of passing or following, a steering direction, and the like.

Alternatively, the classification of slot bit values may be: the method comprises the steps of enumerating slot position values, text slot position values or environment slot position values, wherein the enumerating slot position values represent that the slot position values are predefined enumeration values, the text slot position values represent that the slot position values are substrings in user instructions or texts generated according to the user instructions, and the environment slot position values represent that the slot position values are marks made in environment information according to contents mentioned in the user instructions.

Optionally, the processing unit 1220 may be further configured to: judging whether the driving intention is feasible or not according to the driving intention, the surrounding environment and traffic regulations; and if the driving intention is feasible, generating an automatic driving control instruction for the vehicle.

Optionally, the user instruction comprises: any one or more of a user voice instruction, a user text instruction, a user clear gesture instruction.

Optionally, the apparatus 1200 may further include: a transmitting unit 1230, where the transmitting unit 1230 may be configured to transmit a photographing activation signal to the photographing device to activate the photographing device to photograph the environmental information around the vehicle;

the obtaining unit 1210 may further be configured to: the environmental information around the vehicle photographed by the photographing device according to the photographing activation signal is acquired.

Optionally, the obtaining unit 1210 may further be configured to: environmental information around the vehicle periodically captured by the capturing device is acquired.

Alternatively, the user's driving intent may be presented to the user via an augmented reality-heads-up display AR-HUD or a center screen.

Fig. 13 is an exercise device of a multi-modal processing module according to an embodiment of the present application. As shown in fig. 13, the apparatus 1300 includes an acquisition unit 1310 and a processing unit 1320.

The obtaining unit 1310 is configured to obtain training data, where the training data includes training input data and training target data, the training input data includes a user instruction and environmental information around the vehicle, and the training target data includes a driving intention corresponding to the training input data.

The processing unit 1320 is configured to train the multi-modal processing module according to the training input data and the training target data.

Alternatively, the driving intent may include at least one intent, each intent of the at least one intent including n slots, each slot of the n slots including a slot name, a slot value, and a classification of slot values, n being greater than or equal to 0, n being an integer.

Optionally, the slot value may be classified as: the method comprises the steps of enumerating slot position values, text slot position values or environment slot position values, wherein the enumerating slot position values represent that the slot position values are predefined enumeration values, the text slot position values represent that the slot position values are substrings in user instructions or texts generated according to the user instructions, and the environment slot position values represent that the slot position values are marks made in environment information according to contents mentioned in the user instructions.

Fig. 14 is a diagram illustrating a structure of an apparatus according to an embodiment of the present disclosure. The apparatus 1400 comprises a processor 1402, a communication interface 1403, and a memory 1404.

Alternatively, an example of the apparatus 1400 may be a chip. Another example of apparatus 1400 may be a computing device.

The processor 1402, the memory 1404, and the communication interface 1403 may communicate with each other via a bus. The memory 1404 has stored therein executable code, which the processor 1402 reads from the memory 1404 to perform a corresponding method. Other software modules required to execute a process, such as an operating system, may also be included in memory 1404. The operating system may be LINUX^TM，UNIX^TM，WINDOWS^TMAnd the like.

For example, the executable code in memory 1404 is used to implement the method shown in fig. 4 or fig. 10, and processor 1402 reads the executable code in memory 1404 to perform the method shown in fig. 4 or fig. 10.

The processor 1402 may be a CPU. The memory 1404 may include a volatile memory (volatile memory), such as a Random Access Memory (RAM). The memory 1404 may also include a non-volatile memory (2 NVM), such as a read-only memory (2 ROM), a flash memory, a Hard Disk Drive (HDD) or a Solid State Drive (SSD).

In some embodiments of the present application, the disclosed methods may be implemented as computer program instructions encoded on a computer-readable storage medium in a machine-readable format or encoded on other non-transitory media or articles of manufacture. Fig. 15 schematically illustrates a conceptual partial view of an example computer program product comprising a computer program for executing a computer process on a computing device, arranged in accordance with at least some embodiments presented herein. In one embodiment, the example computer program product 1500 is provided using a signal bearing medium 1501. The signal bearing medium 1501 may include one or more program instructions 1502 which, when executed by one or more processors, may provide the functions or portions of the functions described above with respect to the methods shown in fig. 4 or 10. Thus, for example, referring to the embodiment shown in fig. 4, one or more features of S410-S440 may be undertaken by one or more instructions associated with the signal bearing medium 1501.

In some examples, signal bearing medium 1501 may include a computer readable medium 1503, such as, but not limited to, a hard disk drive, a Compact Disc (CD), a Digital Video Disc (DVD), a digital tape, a memory, a read-only memory (ROM), a Random Access Memory (RAM), or the like. In some implementations, the signal bearing medium 1501 may include a computer recordable medium 1504 such as, but not limited to, memory, read/write (R/W) CD, R/W DVD, and the like. In some implementations, signal bearing medium 1501 may include a communication medium 1505 such as, but not limited to, digital and/or analog communication media (e.g., fiber optic cables, waveguides, wired communications links, wireless communications links, etc.). Thus, for example, signal bearing medium 1501 may be conveyed by a wireless form of communication medium 1505 (e.g., a wireless communication medium conforming to the IEEE 802.11 standard or other transmission protocol). The one or more program instructions 1502 may be, for example, computer-executable instructions or logic-implemented instructions. In some examples, the aforementioned computing devices may be configured to provide various operations, functions, or actions in response to program instructions 1502 communicated to the computing device through one or more of computer-readable media 1503, computer-recordable media 1504, and/or communication media 1505. It should be understood that the arrangements described herein are for illustrative purposes only. Thus, those skilled in the art will appreciate that other arrangements and other elements (e.g., machines, interfaces, functions, orders, and groupings of functions, etc.) can be used instead, and that some elements may be omitted altogether depending upon the desired results. In addition, many of the described elements are functional entities that may be implemented as discrete or distributed components or in conjunction with other components, in any suitable combination and location.

As used in this specification, the terms "component," "module," "system," and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device can be a component. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between 2 or more computers. In addition, these components can execute from various computer readable media having various data structures stored thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from two components interacting with another component in a local system, distributed system, and/or across a network such as the internet with other systems by way of the signal).

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of controlling travel of a vehicle, comprising:

acquiring a user instruction in an automatic driving mode of the vehicle;

acquiring environmental information around the vehicle;

performing multi-modal understanding on the user instruction and environmental information around the vehicle, and determining the driving intention of the user;

and generating an automatic driving control instruction for the vehicle according to the driving intention of the user.

2. The method of claim 1, wherein the travel intent comprises at least one intent, each intent of the at least one intent comprising n slots, each slot of the n slots comprising a slot name, a slot value, and a classification of the slot value, n being greater than or equal to 0, n being an integer.

3. The method of claim 2, wherein the intent comprises: at least one of parking, overtaking, decelerating, following and steering.

4. The method of claim 2 or 3, wherein the slot name comprises: at least one of a position of parking, a speed value, an object of passing or following, and a steering orientation.

5. The method of any of claims 2 to 4, wherein the classification of the bin value is: enumerating a class slot position value, a text class slot position value or an environment class slot position value,

the enumeration type slot value indicates that a slot value is a predefined enumeration value, the text type slot value indicates that the slot value is a substring in the user instruction or a text generated according to the user instruction, and the environment type slot value indicates that the slot value is an identifier made in the environment information according to the content mentioned in the user instruction.

6. The method of any one of claims 1 to 5, wherein the generating an autopilot control command for the vehicle in accordance with the user's travel intent comprises:

judging whether the driving intention is feasible or not according to the driving intention, the surrounding environment and traffic regulations;

and if the driving intention is feasible, generating an automatic driving control instruction for the vehicle.

7. The method of any of claims 1-6, wherein the user instruction comprises: any one or more of a user voice instruction, a user text instruction, a user clear gesture instruction.

8. The method of any of claims 1 to 7, further comprising:

sending a shooting activation signal to a shooting device to activate the shooting device to shoot the environmental information around the vehicle;

the acquiring environmental information around the vehicle includes:

and acquiring the environmental information around the vehicle, which is shot by the shooting device according to the shooting activation signal.

9. The method of any one of claims 1 to 7, wherein the obtaining environmental information about the vehicle comprises:

environmental information around the vehicle periodically captured by a capturing device is acquired.

10. The method according to any of claims 1 to 9, wherein the driving intent of the user is presented to the user through an augmented reality-heads-up display AR-HUD or a center screen.

11. An apparatus for controlling the travel of a vehicle, characterized by comprising an acquisition unit and a processing unit, in an automatic driving mode of said vehicle,

the acquisition unit is used for acquiring a user instruction;

the acquisition unit is further configured to acquire environmental information around the vehicle;

the processing unit is used for performing multi-modal understanding on the user instruction and the environmental information around the vehicle, and determining the driving intention of the user;

the processing unit is further used for generating an automatic driving control instruction for the vehicle according to the driving intention of the user.

12. The apparatus of claim 11, wherein the travel intent comprises at least one intent, each intent of the at least one intent comprising n slots, each slot of the n slots comprising a slot name, a slot value, and a classification of the slot value, n being greater than or equal to 0, n being an integer.

13. The apparatus of claim 12, wherein the intent comprises: at least one of parking, overtaking, decelerating, following and steering.

14. The apparatus of claim 12 or 13, wherein the slot name comprises: at least one of a position of parking, a speed value, an object of passing or following, and a steering orientation.

15. The apparatus of any of claims 12 to 14, wherein the classification of the bin value is: enumerating a class slot position value, a text class slot position value or an environment class slot position value,

the enumeration-type slot value indicates that a slot value is a predefined enumeration value, the text-type slot value indicates that the slot value is a substring in the user instruction or a text generated according to the user instruction, and the environment-type slot value indicates that the slot value is an identifier made in the environment information according to the content mentioned in the text user instruction.

16. The apparatus of any of claims 11 to 15, wherein the processing unit is further to:

17. The apparatus of any of claims 11 to 16, wherein the user instruction comprises: any one or more of a user voice instruction, a user text instruction, a user clear gesture instruction.

18. The apparatus of any one of claims 11 to 17, further comprising: a sending unit, configured to,

the acquisition unit is further configured to:

19. The apparatus of any of claims 11 to 17, wherein the obtaining unit is further configured to:

20. The apparatus of any one of claims 11 to 19, wherein the user's driving intent is presented to the user through an augmented reality-heads-up display AR-HUD or a center screen.

21. A method of training a multi-modal processing module, comprising:

acquiring training data, wherein the training data comprises training input data and training target data, the training input data comprises a user instruction and environmental information around a vehicle, and the training target data comprises a driving intention corresponding to the training input data;

and training the multi-modal processing module according to the training input data and the training target data.

22. The method of claim 21, wherein the travel intent comprises at least one intent, each intent of the at least one intent comprising n slots, each slot of the n slots comprising a slot name, a slot value, and a classification of the slot value, n being greater than or equal to 0, n being an integer.

23. The method of claim 22, wherein the intent comprises: at least one of parking, overtaking, decelerating, following and steering.

24. The method of claim 22 or 23, wherein the slot name comprises: at least one of a position of parking, a speed value, an object of passing or following, and a steering orientation.

25. The method of any of claims 22 to 24, wherein the classification of the bin value is: enumerating a class slot position value, a text class slot position value or an environment class slot position value,

26. The training device of the multi-modal processing module is characterized by comprising an acquisition unit and a processing unit,

the acquisition unit is used for acquiring training data, wherein the training data comprises training input data and training target data, the training input data comprises a user instruction and environmental information around a vehicle, and the training target data comprises a driving intention corresponding to the training input data;

the processing unit is used for training the multi-modal processing module according to the training input data and the training target data.

27. The apparatus of claim 26, wherein the travel intent comprises at least one intent, each intent of the at least one intent comprising n slots, each slot of the n slots comprising a slot name, a slot value, and a classification of the slot value, n being greater than or equal to 0, n being an integer.

28. The apparatus of claim 27, wherein the intent comprises: at least one of parking, overtaking, decelerating, following and steering.

29. The apparatus of claim 27 or 28, wherein the slot name comprises: at least one of a position of parking, a speed value, an object of passing or following, and a steering orientation.

30. The apparatus of any one of claims 27 to 29, wherein the classification of the bin value is: enumerating a class slot position value, a text class slot position value or an environment class slot position value,

31. A method of processing a multi-modal processing module, wherein the multi-modal processing module is trained according to the training method of any one of claims 21 to 25; the processing method comprises the following steps:

the multi-modal processing module obtains input data, wherein the input data comprises user instructions and environmental information around the vehicle;

the multi-modal processing module outputs a driving intention according to the input data.

32. A multi-modal processing module, wherein the multi-modal processing module is trained according to the training method of any one of claims 21 to 25; the multi-modality processing module includes:

an acquisition unit configured to acquire input data including a user instruction and environmental information around the vehicle;

and the processing unit is used for outputting the driving intention according to the input data.

33. An apparatus for controlling the travel of a vehicle, comprising a processor and a memory, the memory being configured to store program instructions, the processor being configured to invoke the program instructions to perform a method of controlling the travel of a vehicle as claimed in any one of claims 1 to 10.

34. An apparatus for training a multi-modal processing module, comprising a processor and a memory, the memory for storing program instructions, the processor for invoking the program instructions to perform a method of training the multi-modal processing module of any of claims 21 to 25.

35. An autonomous vehicle, characterized by comprising an apparatus for controlling vehicle travel according to any one of claims 11 to 20.

36. A computer-readable storage medium, characterized in that program instructions are stored therein, which when executed by a processor, implement the method of controlling the running of a vehicle according to any one of claims 1 to 10; and/or, implementing a training method of the multi-modal processing module of any of claims 21 to 25.