WO2022222854A1 - 一种数据处理方法及相关设备 - Google Patents

一种数据处理方法及相关设备 Download PDF

Info

Publication number
WO2022222854A1
WO2022222854A1 PCT/CN2022/087028 CN2022087028W WO2022222854A1 WO 2022222854 A1 WO2022222854 A1 WO 2022222854A1 CN 2022087028 W CN2022087028 W CN 2022087028W WO 2022222854 A1 WO2022222854 A1 WO 2022222854A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
vectors
data unit
predicted
target
Prior art date
Application number
PCT/CN2022/087028
Other languages
English (en)
French (fr)
Inventor
魏俊秋
廖亿
蒋欣
刘群
钱莉
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22790962.9A priority Critical patent/EP4318322A1/en
Publication of WO2022222854A1 publication Critical patent/WO2022222854A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle

Definitions

  • the present application relates to the field of artificial intelligence, and in particular, to a data processing method and related equipment.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that responds in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • a language model refers to a model that can predict unknown words in a sentence based on a part of a given semantic fragment. For example: Given a natural language sequence fragment "Huawei __ is very good.”, the language model can generate unknown words based on the fragment. For example, in this example, the language model can generate the word "mobile phone” based on the given fragment, and then get the sentence. For "Huawei phone is very good.”.
  • the auto-encoding model and the auto-regressive language model are integrated. Compared with the auto-encoding model and the auto-regressive language model, this model doubles the number of hidden states.
  • the white part corresponds to the auto-encoding model
  • the gray part corresponds to the auto-regressive language model
  • the latent variables related to the auto-encoding model are used to represent location information
  • the auto-regressive model is used to provide the context information predicted by the auto-encoding language model.
  • Memory consumption is twice that of autoencoding and autoregressive models. Therefore, there is a need to provide a language model that consumes less computation and memory.
  • the present application provides a data processing method, characterized in that the method includes:
  • each first embedded vector is used to represent a known data unit in the target data and the No. 1 position of the one known data unit in the target data.
  • the second embedding vector is used to represent the second position of the first data unit to be predicted in the target data in the target data; the M is a positive integer;
  • the target data is data with missing data, wherein the target data includes non-missing data (referred to as known data units in the embodiment of the present application) and missing data (referred to as data to be predicted in the embodiment of the present application) units, such as the first to-be-predicted data unit and the second to-be-predicted data unit).
  • the known data unit is a data unit in the data that is not missing.
  • the target data may be text data
  • the known data unit in the target data may be a known word or a known word in the text data.
  • the to-be-predicted data unit may be a to-be-predicted word or a to-be-predicted word in the text data.
  • the target data can be voice data
  • the known data unit in the target data can be a known audio sequence in the voice data
  • the data unit to be predicted can be a to-be-predicted audio sequence in the voice data
  • the target data may be image data
  • the known data units in the target data may be known pixels in the voice data
  • the data units to be predicted may be pixels to be predicted in the voice data.
  • the data granularity of the known data unit and the data unit to be predicted is related to the type of target data
  • the data granularity of the known data unit and the data unit to be predicted may be the smallest data unit in the target data or the smallest data unit.
  • a plurality of data units formed, the granularity of known data units and to-be-predicted data units is not limited here.
  • the M first embedded vectors are processed by the target encoder to obtain M first output vectors corresponding to the M known data units, wherein the first output vectors corresponding to each of the known data units are is generated according to the M first embedding vectors;
  • each first output vector is obtained based on M first embedding vectors
  • each first output vector can use M first embedding vectors as a reference, that is, when generating each first output vector When , each first embedding vector is visible, or each first output vector has a dependency relationship with M first embedding vectors;
  • the target encoder may be a transform transformer layer, and each first output vector is obtained based on the M first embedding vectors, which can be understood as any two first embedding vectors among the M first embedding vectors There is an attentional correlation between them.
  • the target encoder may use the M first embedding vectors as inputs, where the first embedding vectors include the positions of each known data unit information and the data information of the known data unit, there is no need to separately set additional M pieces of position information as the input of the target encoder.
  • the number of latent variables of the intermediate output of the target encoder is also the same as the input embedding vector. The number remains the same, reducing the computation and memory consumption of the target encoder.
  • the first position is used to indicate a relative positional relationship between the known data unit and other known data units and between the known data unit and the first to-be-predicted data unit ;
  • the second position is used to indicate the relative positional relationship between the first data unit to be predicted and each known data unit in the target data.
  • the target encoder is a first transform transformer layer
  • the target prediction network is a second transformer layer.
  • the first transformer layer includes multiple sub-transformer layers in series; the M first embedding vectors are processed by the target encoder to obtain M known data units The corresponding M first output vectors, including:
  • the data output by the previous sub-transformer layer adjacent to the sub-transformer layer is processed to obtain M intermediate vectors, and the M intermediate vectors are output to the same as the sub-transformer layer.
  • the next sub-transformer layer adjacent to the sub-transformer layer wherein, if the sub-transformer layer is the transformer layer closest to the input side among the multiple sub-transformer layers, the input data of the sub-transformer layer is the M The first embedding vector; if the sub-transformer layer is the transformer layer closest to the output side among the plurality of sub-transformer layers, the data output by the sub-transformer layer is the M first output vectors.
  • each sub-transformer layer includes M feature vectors corresponding to M known data units
  • the output of each sub-transformer layer includes M output vectors corresponding to M known data units.
  • the number of latent variables in the intermediate output of the target encoder is also consistent with the number of input embedding vectors, which reduces the computational complexity and memory consumption of the target encoder.
  • the target encoder includes an attention head
  • the processing of the M first embedding vectors by the target encoder includes:
  • the attention information is used to indicate that when the attention head processes the M first embedding vectors, there is attention between any two first embedding vectors in the M first embedding vectors force association;
  • the M first embedding vectors are processed by the target encoder.
  • the method further includes:
  • the M known data units in the target data are embedded through the embedding layer to obtain M third embedding vectors; the embedding layer may be referred to as an input embedding layer.
  • the current input can be M known data units.
  • the embedding layer can perform embedding processing on each known data unit in the current input, and can obtain the embedding vector (that is, the third embedding vector) corresponding to each known data unit;
  • the M known data units may also be obtained The position vector of each known data unit in the unit, the position vector is used to indicate the first position; wherein, the first position is used to indicate the position of the known data unit in the target data, specifically, the The first position can be used to indicate the relative positional relationship between the known data unit in the target data and other known data units other than itself and the first data unit to be predicted;
  • the fusion method can be the combination of the third embedding vector and the corresponding position vector.
  • the position vector is added, or the first embedded vector can carry a known data unit in the target data and the information of the first position of the known data unit in the target data through other operations, here and The specific fusion method is not limited.
  • the target data further includes a second to-be-predicted data unit, and the prediction sequence of the second to-be-predicted data unit and the first to-be-predicted data unit is determined randomly.
  • the method further includes:
  • the fourth embedded vector is used to represent the first to-be-predicted data unit and the second position of the first to-be-predicted data unit in the target data
  • the The fifth embedded vector is used to represent the third position of the second to-be-predicted data unit in the target data
  • the M first embedded vectors and the fourth embedded vectors are processed to obtain M known data units and M+1th data units corresponding to the first to-be-predicted data units two output vectors;
  • the M+1 second output vectors and the fifth embedded vector are processed to obtain the second to-be-predicted data unit.
  • a random sequence is used for prediction, and the sequence information of the data units to be predicted is fully utilized, and the sequence information is explicitly incorporated into the output vector.
  • the second output vector corresponding to each known data unit is generated according to the M first embedding vectors; the second output vector corresponding to the first data unit to be predicted is generated according to the M first embedding vectors and the fourth embedding vectors.
  • the target data is text data
  • the known data unit is a known word in the text data
  • the first to-be-predicted data unit is a to-be-predicted data unit in the text data word
  • the target data is voice data
  • the known data unit is a known audio sequence in the voice data
  • the first to-be-predicted data unit is a to-be-predicted audio sequence in the voice data
  • the target data is image data
  • the known data unit is a known pixel in the image data
  • the first to-be-predicted data unit is a to-be-predicted pixel in the image data.
  • the present application provides a data processing method, the method comprising:
  • each first embedded vector is used to represent a data unit in the target data and the first position of the one data unit in the target data, so The second embedded vector is used to indicate the target processing task;
  • the M is a positive integer;
  • the M first embedding vectors are processed by the target encoder to obtain M output vectors corresponding to the M data units, wherein the output vectors corresponding to each of the data units are based on the M first embedding vectors.
  • processing corresponding to the target processing task is performed on the M output vectors and the second embedding vector to obtain a task processing result.
  • the first position is used to indicate a relative positional relationship between the data unit and other data units.
  • the target encoder is a first transformer layer
  • the task network is a second transformer layer
  • the first transformer layer includes multiple sub-transformer layers in series; the M first embedding vectors are processed by the target encoder to obtain M known data units The corresponding M first output vectors, including:
  • the data output by the previous sub-transformer layer adjacent to the sub-transformer layer is processed to obtain M intermediate vectors, and the M intermediate vectors are output to the same as the sub-transformer layer.
  • the next sub-transformer layer adjacent to the sub-transformer layer wherein, if the sub-transformer layer is the transformer layer closest to the input side among the multiple sub-transformer layers, the input data of the sub-transformer layer is the M The first embedding vector; if the sub-transformer layer is the transformer layer closest to the output side among the multiple sub-transformer layers, the data output by the sub-transformer layer is the M output vectors.
  • the target encoder includes an attention head
  • the processing of the M first embedding vectors by the target encoder includes:
  • the attention information is used to indicate that when the attention head processes the M first embedding vectors, there is attention between any two first embedding vectors in the M first embedding vectors force association;
  • the M first embedding vectors are processed by the target encoder.
  • the target data is text data
  • the data unit is a word in the text data
  • the target data is voice data
  • the known data unit is an audio sequence in the voice data
  • the target data is image data
  • the known data units are pixels in the image data.
  • the target processing task includes short text classification, long text classification, natural language inference, text similarity matching or text sentiment classification.
  • the application provides a data processing method, the method comprising:
  • each first embedding vector is used to represent a known data unit in the target data and the one known data unit the first position of the data unit in the target data
  • the second embedding vector is used to represent the second position of the first data unit to be predicted in the target data in the target data
  • the M is positive integer
  • the M first embedded vectors are processed to obtain M first output vectors corresponding to the M known data units, wherein the first output vector corresponding to each of the known data units An output vector is generated according to the M first embedding vectors;
  • the first encoder and the first prediction network are updated to obtain a target encoder and a target prediction network.
  • the first position is used to indicate a relative positional relationship between the known data unit and other known data units and between the known data unit and the first to-be-predicted data unit ;
  • the second position is used to indicate the relative positional relationship between the first data unit to be predicted and each known data unit in the target data.
  • the first encoder is a first transform transformer layer
  • the first prediction network is a second transformer layer.
  • the first transformer layer includes a plurality of sub-transformer layers in series; the M first embedded vectors are processed by the first encoder to obtain M M first output vectors corresponding to known data units, including:
  • the data output by the previous sub-transformer layer adjacent to the sub-transformer layer is processed to obtain M intermediate vectors, and the M intermediate vectors are output to the same as the sub-transformer layer.
  • the next sub-transformer layer adjacent to the sub-transformer layer wherein, if the sub-transformer layer is the transformer layer closest to the input side among the multiple sub-transformer layers, the input data of the sub-transformer layer is the M The first embedding vector; if the sub-transformer layer is the transformer layer closest to the output side among the plurality of sub-transformer layers, the data output by the sub-transformer layer is the M first output vectors.
  • the target data further includes a second to-be-predicted data unit, and the prediction sequence of the second to-be-predicted data unit and the first to-be-predicted data unit is determined randomly.
  • the method further includes:
  • the fourth embedded vector is used to represent the first to-be-predicted data unit and the second position of the first to-be-predicted data unit in the target data
  • the The fifth embedded vector is used to represent the third position of the second data unit to be predicted in the target data in the target data
  • the M first embedded vectors and the fourth embedded vectors are processed to obtain M known data units and M+1 corresponding to the first to-be-predicted data unit the second output vector;
  • the M+1 second output vectors and the fifth embedded vector are processed to obtain the fourth to-be-predicted data unit;
  • the first encoder and the first prediction network are updated based on the difference between the third prediction data unit and the first to-be-predicted data unit to obtain a target encoder and a target prediction network, including :
  • the second output vector corresponding to each known data unit is generated according to the M first embedding vectors; the second output vector corresponding to the first data unit to be predicted is generated according to the M first embedding vectors and the fourth embedding vectors.
  • the target data is text data
  • the known data unit is a known word in the text data
  • the first to-be-predicted data unit is a to-be-predicted data unit in the text data word
  • Described target data is speech data
  • described known data unit is the known audio sequence in described speech data
  • described first data unit to be predicted is the audio sequence to be predicted in described speech data
  • the target data is image data
  • the known data unit is a known pixel in the image data
  • the first to-be-predicted data unit is a to-be-predicted pixel in the image data.
  • the present application provides a data processing device, comprising:
  • the obtaining module is used to obtain M first embedded vectors and second embedded vectors; wherein, each first embedded vector is used to represent a known data unit in the target data and the one known data unit in the the first position in the target data, the second embedding vector is used to represent the second position of the first data unit to be predicted in the target data in the target data; the M is a positive integer;
  • the encoding module is configured to process the M first embedded vectors through the target encoder to obtain M first output vectors corresponding to the M known data units, wherein each of the known data units corresponds to The first output vector of is generated according to the M first embedded vectors;
  • the prediction module is configured to process the M first output vectors and the second embedded vectors through a target prediction network to obtain the first data unit to be predicted.
  • the first position is used to indicate a relative positional relationship between the known data unit and other known data units and between the known data unit and the first to-be-predicted data unit ;
  • the second position is used to indicate the relative positional relationship between the first data unit to be predicted and each known data unit in the target data.
  • the target encoder is a first transform transformer layer
  • the target prediction network is a second transformer layer.
  • the first transformer layer includes multiple sub-transformer layers in series; the M first embedding vectors are processed by the target encoder to obtain M known data units The corresponding M first output vectors, including:
  • the data output by the previous sub-transformer layer adjacent to the sub-transformer layer is processed to obtain M intermediate vectors, and the M intermediate vectors are output to the same as the sub-transformer layer.
  • the next sub-transformer layer adjacent to the sub-transformer layer wherein, if the sub-transformer layer is the transformer layer closest to the input side among the multiple sub-transformer layers, the input data of the sub-transformer layer is the M The first embedding vector; if the sub-transformer layer is the transformer layer closest to the output side among the plurality of sub-transformer layers, the data output by the sub-transformer layer is the M first output vectors.
  • the target encoder includes an attention head
  • the encoding module is configured to obtain attention information, where the attention information is used to indicate that the attention head is processing the Mth
  • the attention information is used to indicate that the attention head is processing the Mth
  • the M first embedding vectors are processed by the target encoder.
  • the apparatus further includes:
  • an embedding module configured to perform embedding processing on the M known data units in the target data through the embedding layer to obtain M third embedding vectors;
  • the target data further includes a second to-be-predicted data unit, and the prediction sequence of the second to-be-predicted data unit and the first to-be-predicted data unit is determined randomly.
  • the method further includes:
  • the fourth embedded vector is used to represent the first to-be-predicted data unit and the second position of the first to-be-predicted data unit in the target data
  • the The fifth embedded vector is used to represent the third position of the second to-be-predicted data unit in the target data
  • the M first embedded vectors and the fourth embedded vectors are processed to obtain M known data units and M+1th data units corresponding to the first to-be-predicted data units two output vectors;
  • the M+1 second output vectors and the fifth embedded vector are processed to obtain the second to-be-predicted data unit.
  • the second output vector corresponding to each known data unit is generated according to the M first embedding vectors; the second output vector corresponding to the first data unit to be predicted is generated according to the M first embedding vectors and the fourth embedding vectors.
  • the target data is text data
  • the known data unit is a known word in the text data
  • the first to-be-predicted data unit is a to-be-predicted data unit in the text data word
  • the target data is voice data
  • the known data unit is a known audio sequence in the voice data
  • the first to-be-predicted data unit is a to-be-predicted audio sequence in the voice data
  • the target data is image data
  • the known data unit is a known pixel in the image data
  • the first to-be-predicted data unit is a to-be-predicted pixel in the image data.
  • the present application provides a data processing device, comprising:
  • the obtaining module is used to obtain M first embedded vectors and second embedded vectors; wherein, each first embedded vector is used to represent a data unit in the target data and the value of the one data unit in the target data.
  • the first position, the second embedding vector is used to indicate the target processing task;
  • the M is a positive integer;
  • the encoding module is configured to process the M first embedded vectors through the target encoder to obtain M output vectors corresponding to the M data units, wherein the output vectors corresponding to each of the data units are generated by the M first embedding vectors;
  • a task processing module configured to perform processing corresponding to the target processing task on the M output vectors and the second embedding vector through a task network to obtain a task processing result.
  • the first position is used to indicate a relative positional relationship between the data unit and other data units.
  • the target encoder is a first transformer layer
  • the task network is a second transformer layer
  • the first transformer layer includes multiple sub-transformer layers in series; the M first embedding vectors are processed by the target encoder to obtain M known data units The corresponding M first output vectors, including:
  • the data output by the previous sub-transformer layer adjacent to the sub-transformer layer is processed to obtain M intermediate vectors, and the M intermediate vectors are output to the same as the sub-transformer layer.
  • the next sub-transformer layer adjacent to the sub-transformer layer wherein, if the sub-transformer layer is the transformer layer closest to the input side among the multiple sub-transformer layers, the input data of the sub-transformer layer is the M The first embedding vector; if the sub-transformer layer is the transformer layer closest to the output side among the multiple sub-transformer layers, the data output by the sub-transformer layer is the M output vectors.
  • the target encoder includes an attention head
  • the encoding module is configured to obtain attention information, where the attention information is used to indicate that the attention head is processing the Mth
  • the attention information is used to indicate that the attention head is processing the Mth
  • the M first embedding vectors are processed by the target encoder.
  • the target data is text data
  • the data unit is a word in the text data
  • the target data is voice data
  • the known data unit is an audio sequence in the voice data
  • the target data is image data
  • the known data units are pixels in the image data.
  • the target processing task includes short text classification, long text classification, natural language inference, text similarity matching or text sentiment classification.
  • the present application provides a data processing device, comprising:
  • an acquisition module for acquiring a first encoder, a first prediction network, M first embedding vectors, and a second embedding vector; wherein each first embedding vector is used to represent a known data unit in the target data and the first position of the one known data unit in the target data, and the second embedding vector is used to represent the second position of the first to-be-predicted data unit in the target data in the target data;
  • the M is a positive integer
  • an encoding module configured to process the M first embedded vectors through the first encoder to obtain M first output vectors corresponding to the M known data units, wherein each of the known The first output vector corresponding to the data unit is generated according to the M first embedding vectors;
  • a prediction module configured to process the M first output vectors and the second embedded vectors through the first prediction network to obtain a third prediction data unit;
  • a model training module for updating the first encoder and the first prediction network based on the difference between the third predicted data unit and the first to-be-predicted data unit to obtain a target encoder and a target prediction network.
  • the first position is used to indicate a relative positional relationship between the known data unit and other known data units and between the known data unit and the first to-be-predicted data unit ;
  • the second position is used to indicate the relative positional relationship between the first data unit to be predicted and each known data unit in the target data.
  • the first encoder is a first transformation transformer layer
  • the first prediction network is a second transformer layer.
  • the first transformer layer includes a plurality of sub-transformer layers in series; the M first embedded vectors are processed by the first encoder to obtain M M first output vectors corresponding to known data units, including:
  • the data output by the previous sub-transformer layer adjacent to the sub-transformer layer is processed to obtain M intermediate vectors, and the M intermediate vectors are output to the same as the sub-transformer layer.
  • the next sub-transformer layer adjacent to the sub-transformer layer wherein, if the sub-transformer layer is the transformer layer closest to the input side among the multiple sub-transformer layers, the input data of the sub-transformer layer is the M The first embedding vector; if the sub-transformer layer is the transformer layer closest to the output side among the plurality of sub-transformer layers, the data output by the sub-transformer layer is the M first output vectors.
  • the target data further includes a second to-be-predicted data unit, and the prediction sequence of the second to-be-predicted data unit and the first to-be-predicted data unit is determined randomly.
  • the method further includes:
  • the fourth embedded vector is used to represent the first to-be-predicted data unit and the second position of the first to-be-predicted data unit in the target data
  • the The fifth embedded vector is used to represent the third position of the second data unit to be predicted in the target data in the target data
  • the M first embedded vectors and the fourth embedded vectors are processed to obtain M known data units and M+1 corresponding to the first to-be-predicted data unit the second output vector;
  • the M+1 second output vectors and the fifth embedded vector are processed to obtain the fourth to-be-predicted data unit;
  • the first encoder and the first prediction network are updated based on the difference between the third prediction data unit and the first to-be-predicted data unit to obtain a target encoder and a target prediction network, including :
  • the second output vector corresponding to each known data unit is generated according to the M first embedding vectors; the second output vector corresponding to the first data unit to be predicted is generated according to the M first embedding vectors and the fourth embedding vectors.
  • the target data is text data
  • the known data unit is a known word in the text data
  • the first to-be-predicted data unit is a to-be-predicted data unit in the text data word
  • the target data is voice data
  • the known data unit is a known audio sequence in the voice data
  • the first to-be-predicted data unit is a to-be-predicted audio sequence in the voice data
  • the target data is image data
  • the known data unit is a known pixel in the image data
  • the first to-be-predicted data unit is a to-be-predicted pixel in the image data.
  • an embodiment of the present application provides an execution device, which may include a memory, a processor, and a bus system, wherein the memory is used to store a program, and the processor is used to execute a program in the memory, so as to execute the above-mentioned first aspect and Any optional method thereof, the second aspect and any optional method thereof.
  • an embodiment of the present application provides a training device, which may include a memory, a processor, and a bus system, wherein the memory is used to store a program, and the processor is used to execute a program in the memory, so as to execute the above-mentioned third aspect and any of its optional methods.
  • an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when it runs on a computer, causes the computer to execute the first aspect and any one of the above-mentioned first aspect.
  • an embodiment of the present application provides a computer program that, when run on a computer, causes the computer to execute the first aspect and any optional method thereof, the second aspect and any optional method thereof , the third aspect, and any optional method thereof.
  • the present application provides a chip system
  • the chip system includes a processor for supporting an execution device or a training device to implement the functions involved in the above aspects, for example, sending or processing data involved in the above methods ; or, information.
  • the chip system further includes a memory for storing program instructions and data necessary for executing the device or training the device.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • An embodiment of the present application provides a data processing method, the method includes: acquiring M first embedding vectors and second embedding vectors; wherein each first embedding vector is used to represent a known data in the target data unit and the first position of the one known data unit in the target data, and the second embedding vector is used to represent the second position of the first data unit to be predicted in the target data in the target data position; the M is a positive integer; the M first embedded vectors are processed by the target encoder to obtain M first output vectors corresponding to the M known data units, wherein each of the It is known that the first output vector corresponding to the data unit is generated according to the M first embedding vectors; through the target prediction network, the M first output vectors and the second embedding vectors are processed to obtain the The first data unit to be predicted.
  • the target encoder can use the M first embedding vectors as inputs, wherein the first embedding vectors include the position information and the known data units. data information, there is no need to separately set additional M pieces of position information as the input of the target encoder.
  • the number of latent variables in the intermediate output of the target encoder is also consistent with the number of input embedding vectors, reducing the target Encoder computation and memory consumption.
  • Fig. 1 is a kind of structural schematic diagram of artificial intelligence main frame
  • Figure 2 is a natural language processing system
  • Figure 3a is another natural language processing system
  • Figure 3b is a schematic structural diagram of a system
  • FIG. 4 is a schematic diagram of a related device for natural language processing provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of the architecture of a transformer layer
  • FIG. 6a is a schematic diagram of an embodiment of a data processing method provided by an embodiment of the present application.
  • 6b is a schematic diagram of an embodiment of a data processing method
  • FIG. 6c is a schematic diagram of an embodiment of a data processing method provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a neural network model in an embodiment of the application.
  • FIG. 8 is a schematic diagram of the structure of a transformer layer
  • FIG. 9 is a schematic diagram of the operation of an attention head
  • 10 to 19 are schematic diagrams of embodiments of a data processing method provided by an embodiment of the present application.
  • FIG. 20 is a schematic structural diagram of a data processing apparatus provided by an embodiment of the present application.
  • 21 is a schematic structural diagram of a data processing apparatus provided by an embodiment of the present application.
  • FIG. 22 is a schematic structural diagram of a data processing apparatus provided by an embodiment of the present application.
  • FIG. 23 is a schematic structural diagram of an execution device provided by an embodiment of the present application.
  • FIG. 24 is a schematic structural diagram of a training device provided by an embodiment of the present application.
  • FIG. 25 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • Figure 1 shows a schematic structural diagram of the main frame of artificial intelligence.
  • the above-mentioned artificial intelligence theme framework is explained in two dimensions (vertical axis).
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data has gone through the process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system.
  • the infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform. Communication with the outside world through sensors; computing power is provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA); the basic platform includes distributed computing framework and network-related platform guarantee and support, which can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • smart chips hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA
  • the basic platform includes distributed computing framework and network-related platform guarantee and support, which can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • the data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data from traditional devices, including business data from existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
  • machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall solution of artificial intelligence, and the productization of intelligent information decision-making to achieve landing applications. Its application areas mainly include: intelligent terminals, intelligent transportation, Smart healthcare, autonomous driving, smart city, etc.
  • This application can be applied to the fields of natural language processing, image processing, and audio and video processing in the field of artificial intelligence.
  • the following will take natural language processing as an example to introduce multiple application scenarios that are applied to products.
  • FIG. 2 shows a natural language processing system, which includes user equipment and data processing equipment.
  • the user equipment includes smart terminals such as mobile phones, personal computers, or information processing centers.
  • the user equipment is the initiator of natural language data processing, and as the initiator of requests such as language question and answer or query, the user usually initiates the request through the user equipment.
  • the above-mentioned data processing device may be a device or server with data processing functions, such as a cloud server, a network server, an application server, and a management server.
  • the data processing equipment receives the query sentences/voice/text from the intelligent terminal through the interactive interface, and then performs language data processing in the form of machine learning, deep learning, search, reasoning, decision-making, etc. through the memory for storing data and the processor for data processing. , and feedback the processing results to the user equipment.
  • the memory in the data processing device may be a general term, including local storage and a database for storing historical data.
  • the database may be on the data processing device or on other network servers.
  • the user equipment can receive an instruction from the user, for example, the user equipment can receive a piece of text input by the user, and then initiate a request to the data processing device, so that the data processing device can target the segment obtained by the user equipment.
  • the text executes natural language processing applications (such as natural language generation, text classification, text reasoning, named entity recognition, translation, etc.), so as to obtain the processing results of the corresponding natural language processing applications for this piece of text (such as predicted word results, classification results) , inference results, named entity recognition results, translation results, etc.).
  • natural language generation can also be called a text prediction task or a natural language synthesis task, which refers to the task of generating missing text or subsequent text in a given piece of text.
  • Natural language generation is widely used in search engines, input methods and other scenarios. It can predict the user's next input on the premise of inputting some text, which can greatly improve the efficiency of the user's use of the product. text is restored.
  • the user equipment may receive a piece of text data input by the user (for example, the target data described in the embodiment of the present application), wherein the text data includes known words and words to be predicted, and words to be predicted are not.
  • the user equipment can initiate a request to the data processing device (the request carries the text data), so that the data processing device can predict the word to be predicted in the text data, so as to obtain The word to be predicted is fed back to the user equipment.
  • the user equipment can receive a piece of text data input by the user, and then initiate a request to the data processing device, so that the data processing device can perform entity classification on the piece of text data, so as to obtain the entity classification result for the piece of text data, and The entity classification result is fed back to the user equipment;
  • the user equipment may receive a piece of text data (the text data is Chinese text) input by the user, and then initiate a request to the data processing device, so that the data processing device translates the piece of text data into English, so as to obtain the text data for the piece of text data.
  • the English translation of , and the English translation is fed back to the user device.
  • the data processing device may process the above text data by using the data processing method of the embodiment of the present application.
  • Figure 3a shows another natural language processing system.
  • the user equipment is directly used as a data processing device.
  • the user equipment can directly receive input from the user and process it directly by the hardware of the user equipment itself.
  • the specific process is as follows: Similar to FIG. 2 , reference may be made to the above description, which will not be repeated here.
  • FIG. 4 is a schematic diagram of a related device 300 for natural language processing provided by an embodiment of the present application.
  • the user equipment in FIG. 2 and FIG. 3a may be the local device 301 or the local device 302 in FIG. 4, the data processing device in FIG. 2 may be the execution device 310 in FIG. 4, and the data storage system 350 may be To store the data to be processed by the execution device 310, the data storage system 350 may be integrated on the execution device 310, or may be set on the cloud or other network servers.
  • the processors in FIG. 2 and FIG. 3a may perform data training/machine learning/deep learning through a neural network model or other models, and use the data to finally train or learn a model (for example, the target encoder, target prediction network, task network, etc.) to perform natural language processing applications (such as natural language generation, text classification, sequence tagging, reading comprehension, text generation, text reasoning, translation) on text data (such as the target data described in the embodiments of this application) etc.) to obtain corresponding processing results (for example, the first data unit to be predicted, the second data unit to be predicted, and the task processing result in the embodiment of the present application, etc.).
  • a model for example, the target encoder, target prediction network, task network, etc.
  • natural language processing applications such as natural language generation, text classification, sequence tagging, reading comprehension, text generation, text reasoning, translation
  • processing results for example, the first data unit to be predicted, the second data unit to be predicted, and the task processing result in the embodiment of the present application, etc.
  • the embodiments of the present application can also be applied in the field of image processing and audio and video processing, and the above-mentioned data processing device processes the target data by using the data processing methods of the embodiments of the present application.
  • the above data processing device may also be referred to as a data processing apparatus, an execution device, a server, a terminal device, and the like in subsequent embodiments.
  • FIG. 3b is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • the system architecture 500 includes an execution device 510 , a training device 520 , a database 530 , a client device 540 , a data storage system 550 , and a data acquisition system 560 .
  • the execution device 510 includes a calculation module 511 , an I/O interface 512 , a preprocessing module 513 and a preprocessing module 514 .
  • the calculation module 511 may include the target model/rule 501, and the preprocessing module 513 and the preprocessing module 514 are optional.
  • the data collection device 560 is used to collect training data.
  • the training data in the task of natural language synthesis, can be the text data with missing text and the complete text data corresponding to the text data with missing text; in the task of audio synthesis, the training data can be the missing audio sequence. Speech data and the complete speech data corresponding to the speech data with missing audio sequences; in the task of image synthesis (or called image reconstruction), the training data can be image data or video data with missing pixels and the missing pixels. The complete image data or video data corresponding to the image data or video data.
  • the data collection device 560 stores the training data in the database 530 , and the training device 520 obtains the target model/rule 501 by training based on the training data maintained in the database 530 .
  • the target model/rule 501 for realizing the natural language synthesis task can be used for realizing the natural language synthesis task, that is, the Text data with missing text is input into the target model/rule 501, and the missing text (for example, the first data unit to be predicted and the second data unit to be predicted in the embodiment of the present application) can be obtained.
  • the target model/rule 501 for implementing target processing tasks (such as short text classification, long text classification, natural language inference, text similarity matching, text sentiment classification, etc.) as an example, the above target model/rule 501 (such as implemented in this application)
  • the target encoder and task network in the example can be used to realize the target processing task, that is, input the target data into the target model/rule 501, and then the task processing result can be obtained.
  • the training data maintained in the database 530 does not necessarily come from the collection of the data collection device 560, and may also be received from other devices.
  • the training device 520 may not necessarily train the target model/rule 501 entirely based on the training data maintained by the database 530, and may also obtain training data from the cloud or other places for model training. The above description should not be used as a reference to this application Limitations of Examples.
  • the target model/rule 501 trained according to the training device 520 can be applied to different systems or devices, such as the execution device 510 shown in FIG. 3b, the execution device 510 can be a terminal, such as a mobile phone terminal, a tablet Laptops, augmented reality (AR)/virtual reality (VR) devices, in-vehicle terminals, etc., can also be servers or the cloud.
  • the execution device 510 is configured with an input/output (I/O) interface 512 for data interaction with external devices, and the user can input data to the I/O interface 512 through the client device 540 (for example, this target data in the application examples).
  • I/O input/output
  • the preprocessing module 513 and the preprocessing module 514 are used for preprocessing according to the input data received by the I/O interface 512 (for example, obtaining the known data unit and the position of the data unit to be predicted in the target data, or generating attention information, etc. preprocessing). It should be understood that there may be no preprocessing module 513 and preprocessing module 514 or only one preprocessing module. When the preprocessing module 513 and the preprocessing module 514 do not exist, the calculation module 511 can be directly used to process the input data.
  • the execution device 510 When the execution device 510 preprocesses the input data, or the calculation module 511 of the execution device 510 performs calculations and other related processing, the execution device 510 can call the data, codes, etc. in the data storage system 550 for corresponding processing , the data and instructions obtained by corresponding processing may also be stored in the data storage system 550 .
  • the I/O interface 512 processes the results, such as missing text, missing audio sequences, and missing pixels obtained after processing (for example, the first data unit to be predicted, the second data unit to be predicted, the task processing result in the embodiment of the present application ) is presented to the client device 540 for provision to the user.
  • the user can manually give input data, and the “manually give input data” can be operated through the interface provided by the I/O interface 512 .
  • the client device 540 can automatically send the input data to the I/O interface 512. If the user's authorization is required to request the client device 540 to automatically send the input data, the user can set the corresponding permission in the client device 540.
  • the user can view the result output by the execution device 510 on the client device 540, and the specific presentation form can be a specific manner such as display, sound, and action.
  • the client device 540 can also be used as a data collection terminal to collect the input data of the input I/O interface 512 and the output result of the output I/O interface 512 as new sample data, and store them in the database 530 .
  • the I/O interface 512 directly uses the input data input into the I/O interface 512 and the output result of the output I/O interface 512 as shown in the figure as a new sample The data is stored in database 530.
  • FIG. 3b is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 550 is an external memory relative to the execution device 510 , and in other cases, the data storage system 550 may also be placed in the execution device 510 .
  • execution device 510 may also be deployed in the client device 540 .
  • the above-mentioned data storage system 550 may store relevant codes for implementing the data processing method in this embodiment of the present application, and the computing module 511 may obtain the above-mentioned data storage system 550 from the data storage system 550.
  • the code related to the data processing method in the embodiment of the present application is implemented to execute the data processing method in the embodiment of the present application.
  • the computing module 511 may include a hardware circuit (such as an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a general-purpose processor, and a digital signal processor. (digital signal processing, DSP), microprocessor or microcontroller, etc.), or a combination of these hardware circuits, for example, the computing module 511 can be a hardware system with the function of executing instructions, such as CPU, DSP, etc., or a non- A hardware system with a function of executing instructions, such as ASIC, FPGA, etc., or a combination of the above-mentioned hardware system without a function of executing instructions and a hardware system with a function of executing instructions.
  • ASIC application specific integrated circuit
  • FPGA field-programmable gate array
  • DSP digital signal processing
  • microprocessor or microcontroller, etc. digital signal processor
  • the computing module 511 can be a hardware system with the function of executing instructions, such as CPU, DSP, etc., or a non- A hardware system
  • the computing module 511 may be a hardware system with a function of executing instructions
  • the data processing method provided by the embodiment of the present application may be a software code stored in the data storage system 550
  • the computing module 511 may obtain the data from the data storage system 550 .
  • software code and execute the obtained software code to implement the data processing method provided by the embodiment of the present application.
  • calculation module 511 may be a combination of a hardware system without the function of executing instructions and a hardware system with the function of executing instructions, and some steps of the data processing method provided by the embodiments of the present application may also be implemented by the calculation module 511 without the function of executing instructions.
  • the functions are implemented by the hardware system or the preprocessing module 513 and the preprocessing module 514, which are not limited here.
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes xs (ie input data) and an intercept 1 as input, and the output of the operation unit can be:
  • s 1, 2,...n, n is a natural number greater than 1
  • Ws is the weight of xs
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting a plurality of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.
  • FIG. 5 is a schematic diagram of the architecture of a transformer layer.
  • the neural network includes an embedding layer and at least one transformer layer, and at least one transformer layer can be N transformer layers (N is an integer greater than 0), Among them, each transformer layer includes successively adjacent attention layers, summation and normalization (add&norm) layers, feedforward (feed forward) layers, and summation and normalization layers.
  • the current input is embedded to obtain multiple embedding vectors; in the attention layer, P input vectors are obtained from the upper layer of the first transformer layer, and any one of the P input vectors is obtained.
  • the first input vector is the center, and based on the degree of association between each input vector within the preset attention window and the first input vector, an intermediate vector corresponding to the first input vector is obtained, and P input vectors are determined in this way.
  • the corresponding P intermediate vectors; in the pooling layer, the P intermediate vectors are combined into Q output vectors, wherein the multiple output vectors obtained by the last transformer layer in the transformer layer are used as the features of the current input express.
  • the attention mechanism imitates the internal process of biological observation behavior, that is, a mechanism that aligns internal experience and external sense to increase the fineness of observation in some areas, and can use limited attention resources to quickly screen out high-value information from a large amount of information. .
  • Attention mechanism can quickly extract important features from sparse data, so it is widely used in natural language processing tasks, especially machine translation.
  • the self-attention mechanism is an improvement of the attention mechanism, which reduces the dependence on external information and is better at capturing the internal correlation of data or features.
  • the essential idea of the attention mechanism can be rewritten as the following formula:
  • Lx
  • represents the length of Source.
  • the meaning of the formula is to imagine that the constituent elements in Source are composed of a series of data pairs. At this time, given an element Query in the target Target, by calculating the Query and The similarity or correlation of each Key, the weight coefficient of the Value corresponding to each Key is obtained, and then the weighted sum of the Value is obtained, that is, the final Attention value is obtained. So in essence, the Attention mechanism is to weight and sum the Value values of the elements in the Source, and Query and Key are used to calculate the weight coefficient of the corresponding Value.
  • Attention can be understood as selectively screening out a small amount of important information from a large amount of information and focusing on these important information, ignoring most of the unimportant information.
  • the process of focusing is reflected in the calculation of the weight coefficient.
  • the self-attention mechanism can be understood as internal Attention (intra attention).
  • the Attention mechanism occurs between the element Query of the Target and all elements in the Source.
  • the self-attention mechanism refers to the internal elements of the Source or between the internal elements of the Target.
  • NLP Natural language processing
  • Natural language is human language
  • natural language processing is the processing of human language.
  • Natural language processing is the process of systematically analyzing, understanding, and extracting information from text data in an intelligent and efficient manner.
  • NLP natural language processing
  • NER Named entity recognition
  • RE relation extraction
  • IE information extraction
  • sentiment analysis speech recognition
  • speech recognition question answering
  • topic segmentation etc.
  • a pretrained language model is a natural language sequence encoder that encodes each word in the natural language sequence into a vector representation for prediction tasks. Its training consists of two stages. In the pre-training stage, the model is trained on a large-scale unsupervised text for language model tasks to learn a word representation. In the finetuning stage, the model is initialized with the parameters learned in the pre-training stage, and is trained in fewer steps on downstream tasks such as text classification and sequence labeling. The semantic information obtained by pre-training can be successfully transferred to downstream tasks.
  • An autoregressive language model refers to a model that can predict the next possible word (such as "good”) based on a given context (such as "mobile phone is very”), which is usually given the left-side above and the right-side context. , but can also predict a word in the middle given the context on the left and right.
  • FIG. 6a is a schematic diagram of an embodiment of a data processing method provided by an embodiment of the present application.
  • a data processing method provided by an embodiment of the present application can be applied to the data processing device and execution device described above. The specific , the data processing method can be applied to terminal devices such as mobile phones, tablets, notebook computers, smart wearable devices, etc., or applied to the server on the cloud side.
  • terminal devices such as mobile phones, tablets, notebook computers, smart wearable devices, etc.
  • a data processing method provided by an embodiment of the present application include:
  • each first embedded vector is used to represent a known data unit in the target data and the one known data unit in the target data
  • the second embedding vector is used to represent the second position of the first data unit to be predicted in the target data in the target data; the M is a positive integer.
  • the target data is data with missing data, wherein the target data includes non-missing data (referred to as known data units in the embodiment of the present application) and missing data (referred to as data to be predicted in the embodiment of the present application) units, such as the first to-be-predicted data unit and the second to-be-predicted data unit).
  • the known data unit is a data unit in the data that is not missing.
  • the target data may be text data
  • the known data unit in the target data may be a known word or a known word in the text data.
  • the to-be-predicted data unit may be a to-be-predicted word or a to-be-predicted word in the text data.
  • the target data may be speech data
  • the known data unit in the target data may be a known audio sequence in the speech data
  • the data unit to be predicted may be an audio sequence to be predicted in the speech data
  • the target data may be image data
  • the known data units in the target data may be known pixels in the voice data
  • the data units to be predicted may be pixels to be predicted in the voice data.
  • the data granularity of the known data unit and the data unit to be predicted is related to the type of target data
  • the data granularity of the known data unit and the data unit to be predicted may be the smallest data unit in the target data or the smallest data unit.
  • a plurality of data units formed, the granularity of known data units and to-be-predicted data units is not limited here.
  • the target data may include M known data units and at least one to-be-predicted data unit (including the first to-be-predicted data unit), wherein the to-be-predicted data unit is invisible in the target data data, the to-be-predicted data unit needs to be determined through M known data units.
  • the text data may include M known words and at least one word to be predicted (including the first word to be predicted).
  • the text data may be Chinese text, English text, or other language text, and the text data may be sentences, paragraphs, chapters, and the like.
  • the target data can be "__sat on the mat", where “sat”, “on”, “the” and “mat” are known data units, "_” and “_” are not visible in the target data, and are The data unit to be predicted; it should be understood that the meaning of the symbol “_” here is empty, rather than an underscore.
  • M first embedding vectors may be obtained; wherein, each first embedding vector is used to represent a known data unit in the target data and the value of the one known data unit in the target data. first position.
  • the M known data units in the target data may be embedded through the embedding layer to obtain M third embedding vectors.
  • the embedding layer can be called the input embedding layer.
  • the current input can be M known data units.
  • the embedding layer can perform embedding processing on each known data unit in the current input, and can obtain the embedding vector (ie, the third embedding vector) corresponding to each known data unit.
  • a position vector of each known data unit in the M known data units may also be obtained, where the position vector is used to indicate the first position; wherein the first position is used to indicate The position of the known data unit in the target data, specifically, the first position is used to indicate the known data unit and other known data units and the known data unit and the first to-be-predicted data unit relative positional relationship between them.
  • the embedding layer may include an input embedding layer and a positional encoding layer.
  • word embedding processing can be performed on each known data unit in the current input, so as to obtain the third embedding vector of each known data unit.
  • position coding layer the position of each known data unit in the current input can be obtained, and then a position vector can be generated for the position of each known data unit.
  • the first position of each known data unit in the target data may be an absolute position of each known data unit in the target data. Taking the current input as "a few numbers should be paid back" as an example, the position of "a few” can be expressed as the first place, the position of "number” can be expressed as the second place, . . . In some examples, the first position of each known data unit in the target data may be a relative position of each known data unit in the target data. Still taking the current input as "a few days should be paid back" as an example, the position of "a few” can be expressed as before “number”, and the position of "number” can be expressed as after "a few", before “should", ... ....
  • the position vector of each known data unit and the corresponding third embedding vector can be fused to obtain the first embedding of each known data unit vector, that is, to obtain a plurality of first embedding vectors corresponding to the current input.
  • the method of fusion may be to perform an addition operation on the third embedded vector and the position vector, or through other operations, the first embedded vector can carry a known data unit in the target data and the one known data unit in the target data.
  • the information of the first position in the target data is not limited here in a specific fusion manner.
  • the plurality of first embedding vectors may be represented as embedding matrices having preset dimensions.
  • the number of the plurality of first embedding vectors may be set to be M, and the preset dimension may be H dimension, then the plurality of first embedding vectors may be represented as M ⁇ H embedding matrices.
  • a second embedding vector may be obtained, where the second embedding vector is used to represent the second position of the first to-be-predicted data unit in the target data in the target data, where the second The position may be used to indicate the relative positional relationship between the first data unit to be predicted and each known data unit in the target data.
  • an embedding process may be performed on the second position of the first to-be-predicted data unit in the target data through an embedding layer, so as to obtain a data representing the location of the first to-be-predicted data unit in the target data.
  • the second embedding vector of the second position in the target data, the second embedding vector can be used as the input of the subsequent target prediction network.
  • the second position is used to indicate the relative positional relationship between the first to-be-predicted data unit and each known data unit in the target data. A description of a location, similarities will not be repeated here.
  • M first embedding vectors for the M known data units and second embedding vectors for the first to-be-predicted data unit may be obtained.
  • the target encoder can process the M first embedded vectors to obtain M first output vectors corresponding to the M known data units, that is, to obtain one corresponding to each known data unit.
  • the first output vector can be processed to obtain M first embedded vectors corresponding to the M known data units, that is, to obtain one corresponding to each known data unit.
  • the auto-encoding model and the auto-regressive language model are integrated. Compared with the auto-encoding model and the auto-regressive language model, this model doubles the number of hidden states.
  • the white part corresponds to the auto-encoding model
  • the gray part corresponds to the auto-regressive language model
  • the latent variables related to the auto-encoding model are used to represent location information
  • the auto-regressive model is used to provide the context information predicted by the auto-encoding language model.
  • Memory consumption is twice that of autoencoding and autoregressive models.
  • the number of hidden states is the same as the number of hidden states in the auto-encoding and auto-regressive language models.
  • the target encoder can take the M first embedding vectors as input, where the first embedding vector includes the position information and the data information of the known data unit, and there is no need to separately set additional M The position information is used as the input of the target encoder.
  • the number of latent variables in the intermediate output of the target encoder is also consistent with the number of input embedding vectors, which reduces the computational complexity and memory consumption of the target encoder.
  • the input of the target encoder is M first embedding vectors
  • the output is M first output vectors.
  • each first output vector is obtained based on M first embedding vectors.
  • each first output vector is obtained based on M first embedding vectors, it can be understood that each first output vector can use M first embedding vectors as a reference, that is, when generating each first output vector, Each first embedding vector is visible, or each first output vector has a dependency relationship with M first embedding vectors.
  • the target encoder may be the first transformation transformer layer, and each first output vector is obtained based on the M first embedding vectors, which can be understood as any two first among the M first embedding vectors. There is an attentional association between the embedding vectors.
  • the first transformer layer may include multiple sub-transformer layers in series, and the N sub-transformer layers include adjacent first sub-transformer layers and second sub-transformer layers, that is, the first sub-transformer layer
  • the transformer layer and the second sub-transformer layer may be any two adjacent sub-transformer layers in the first transformer layer.
  • the data output by the previous sub-transformer layer adjacent to the sub-transformer layer can be processed through each of the sub-transformer layers to obtain M intermediate vectors, and the M intermediate vectors are output to the same as the sub-transformer layer.
  • the next sub-transformer layer adjacent to the sub-transformer layer wherein, if the sub-transformer layer is the transformer layer closest to the input side among the multiple sub-transformer layers, the input data of the sub-transformer layer is the M a first embedding vector; if the sub-transformer layer is the transformer layer closest to the output side among the plurality of sub-transformer layers, the data output by the sub-transformer layer is the M first output vectors.
  • each sub-transformer layer includes M feature vectors corresponding to M known data units
  • the output of each sub-transformer layer includes M output vectors corresponding to M known data units.
  • the number of latent variables in the intermediate output of the target encoder is also consistent with the number of input embedding vectors, which reduces the computational complexity and memory consumption of the target encoder.
  • each sub-transformer layer includes M feature vectors corresponding to M known data units
  • the output of each sub-transformer layer includes M output vectors corresponding to M known data units.
  • the number of latent variables in the intermediate output of the target encoder is also consistent with the number of input embedding vectors, which reduces the computational complexity and memory consumption of the target encoder.
  • the core feature of the transformer layer is the unique attention mechanism it adopts.
  • the transformer model uses this attention mechanism to assign different attention coefficients to the embedding vector of each word in the sentence, so as to more comprehensively consider the influence of the context in the sentence on each word.
  • the specific transformer layer can include successively adjacent multi-head attention layers, add and normalization (add & norm) layers, feed forward (feed forward) layers, and add and normalize layers.
  • the attention layer is connected to the embedding layer, and M embedding vectors are obtained from the embedding layer as input vectors.
  • the embedding vectors are synthesized to obtain M output vectors, Output to subsequent transformer layers.
  • the transformer layer takes the output of the previous layer as an input vector and performs similar operations as the previous transformer layer.
  • FIG. 8 is a schematic diagram of the structure of a transformer layer.
  • Each sub-transformer layer in this embodiment of the present application may refer to the structure shown in FIG. 8.
  • the transformer layer includes sequential phases. Adjacent multi-head attention layer, add & normalization (add&norm) layer, feed forward (feed forward) layer, add and normalize layer.
  • add&norm add & normalization
  • feed forward feed forward
  • the multi-head attention layer obtains M input vectors X l from the upper layer, which can also be expressed as a matrix X, and uses the self-attention mechanism to transform each vector based on the correlation between the vectors to obtain M output vectors, It can also be represented as a matrix Y.
  • the multi-head attention layer is a layer directly connected to the embedding layer, such as the transformer layer directly connected to the embedding layer in Figure 7, the obtained input vector is the embedding vector output by the embedding layer; when the multi-head attention layer is directly connected to the embedding layer
  • the layer is the multi-head attention layer included in the subsequent transformer layer.
  • the multi-head attention layer included in the transformer layer directly connected to the previous transformer layer in Figure 7 the input vector obtained is the output vector of the previous transformer layer.
  • the multi-head attention (MHA) based MHA layer includes multiple attention heads (Head 1, Head 2, ..., Head N as shown in Fig. 8).
  • FIG. 9 is a schematic diagram of the operation of an attention head, which shows how the attention head transforms an input matrix X into an output matrix Y.
  • FIG. 9 the first transformation matrix Q, the second transformation matrix K and the third transformation matrix V are respectively used to transform each input vector Xi in the M input vectors ⁇ X1, X2,..., XN> to obtain each input vector
  • the vectors correspond to the first intermediate vector (q vector), the second intermediate vector (k vector) and the third intermediate vector (v vector).
  • the first transformation matrix Q, the second transformation matrix K and the third transformation matrix V can be used to linearly transform the input matrix X composed of N input vectors to obtain the Q matrix, K matrix and V of the input matrix respectively.
  • the dot product result of qi and kj can also be directly determined as the correlation degree, more classically, the dot multiplication result is divided by a constant, and then the softmax operation is performed, and the operation result is used as the correlation degree between the input vectors Xi and Xj, which is:
  • each correlation degree ⁇ i,j of the ith input vector Xi and each input vector Xj can be used as a weighting factor, and the third intermediate vector (v vector, vj) corresponding to each input vector Xj can be weighted and combined to obtain the ith
  • a vector sequence ⁇ C1, C2, . . . , CN> of M combined vectors corresponding to the M input vectors, or a matrix C can be obtained.
  • M output vectors can be obtained.
  • the output matrix Y is the combined vector matrix C, which can be written as:
  • the MHA layer maintains m sets of transformation matrices, and each set of transformation matrices includes the aforementioned first transformation matrix Q, second transformation matrix K and third transformation matrix V, thus
  • the above operations can be performed in parallel to obtain m combined vector sequences (ie, m matrices C), each vector sequence including N combined vectors obtained based on a set of transformation matrices.
  • the MHA layer splices the obtained m combination vector sequences to obtain a splicing matrix; and then transforms the splicing matrix through the fourth transformation matrix W to obtain the final output matrix Y.
  • the MHA layer performs a transformation operation based on the correlation between N input vectors, and obtains M output vectors.
  • the transformer layer may include a feedforward layer, where the feedforward layer includes an input layer, an intermediate layer, and an output layer.
  • the feedforward layer includes an input layer, an intermediate layer, and an output layer.
  • neural network models can contain multiple transformer layers.
  • the above-mentioned multiple transformer layers may be stacked and connected in a residual network manner.
  • the target encoder includes an attention header, and since each known data unit is mutually visible in the target data, when processing M first embedding vectors, the M There is an attention correlation between any two first embedding vectors in the first embedding vector. Specifically, attention information can be obtained, and the attention information is used to indicate that the attention head is processing the M first embeddings. vector, there is an attention correlation between any two first embedding vectors in the M first embedding vectors, and then according to the attention information, through the target encoder, the M first embedding vectors can be The vectors are processed so that each output vector has a dependency relationship with the M first embedding vectors.
  • the M output vectors can be input into the target prediction network, and the M first output vectors and the second embedded vectors are processed through the target prediction network to obtain The first to-be-predicted data unit is obtained.
  • the target prediction network may be a transformer layer.
  • the target prediction network can use the M first output vectors and the second embedding vectors as inputs to obtain the vector representation of the first data unit to be predicted.
  • a classifier for example, a support vector machine, a softmax classifier, a K-nearest neighbor algorithm, etc.
  • a classifier can be used to recover the first data unit to be predicted.
  • the first word to be predicted can see its corresponding position vector (the second embedding vector) and each known word (the first embedding vector), and then the target prediction network can be seen.
  • the M first output vectors and the second embedding vectors may be used as inputs to obtain a word vector representation of the first word to be predicted.
  • the words from position 3 to position 6 in the target data are "sat on the mat", and the goal is to predict the first two words in the sentence.
  • the target prediction network can first be based on the 4 corresponding to "sat on the mat”. input vector and prediction position 1 to determine the word of the first to-be-predicted word as "that”. Similarly, the object prediction network next predicts the word at position 2 based on "that_sat on the mat”.
  • the target data further includes a second data unit to be predicted; before the M first embedding vectors are processed by the target encoder, the first data to be predicted may be randomly determined The predicted sequence of the unit and the second to-be-predicted data unit.
  • a fourth embedded vector and a fifth embedded vector can be obtained, where the fourth embedded vector is used to indicate that the first to-be-predicted data unit and the first to-be-predicted data unit are in The second position in the target data, the fifth embedding vector is used to represent the third position of the second data unit to be predicted in the target data in the target data, through the target encoder, to The M first embedded vectors and the fourth embedded vectors are processed to obtain M known data units and M+1 second output vectors corresponding to the first to-be-predicted data units.
  • the prediction network processes the M+1 second output vectors and the fifth embedded vector to obtain the second to-be-predicted data unit.
  • the second output vector corresponding to each known data unit is generated according to the M first embedding vectors; the second output vector corresponding to the first data unit to be predicted is generated according to the M first embedding vectors An embedding vector and the fourth embedding vector are generated.
  • the preprocessing module can process the input word vector sequence, and input the processing results into the autoregressive word vector encoding module and the query module, and the output results of the autoregressive word vector encoding module and the query module can be input into the prediction module.
  • the prediction module can output predicted labels.
  • the autoregressive word vector encoding module can be the target encoder in the above embodiment, the query module is used to generate the second embedded vector and the fifth embedded vector, the prediction module can be the target prediction network in the above embodiment, and the predicted label can be is the data unit to be predicted in the above embodiment.
  • the preprocessing module can perform the operations of sequence rearrangement, segmentation and information extraction on the input word vector sequence.
  • Sequence rearrangement is to reconstruct the order of the input word vectors.
  • the reconstruction methods include but are not limited to: keeping the original order unchanged, random order and reverse order, sequence block
  • the rearranged sentence is divided into two blocks, and the subsequent construction In the model, the information of each word in the first block is visible to all words, and the information of each word in the second block is only visible to the words in the rear position after it is rearranged.
  • the information extraction module extracts three parts of information on the basis of sequence rearrangement, which are the rearranged word vector sequence and the Attention Matrix (this matrix defines the vector of each word in the autoregressive word vector encoding module).
  • the attention matrix can be the attention information in the above embodiment
  • the auxiliary information of the label to be predicted that is, the above implementation
  • the auxiliary information defines the position information of the word to be predicted in the original sentence.
  • the first two parts of information will be output to the autoregressive word vector encoding module, and the third part will be output to the query module. It should be understood that the operation of the preprocessing module is manually defined and does not contain any learnable part.
  • the autoregressive word vector encoding module can learn the context information corresponding to each word, and finally learn a word vector sequence (that is, the output vector in the above embodiment) containing its context information for each word in the sentence.
  • the autoregressive word vector encoding module can be shown in the left figure of Fig. 12.
  • the module includes several layers of autoregressive word vector encoders (each layer can be a sub-transformer layer in the above embodiment), and each layer receives the previous layer The output word vector, and after calculating the dependencies between the word vectors, the context information of each word vector is integrated into the output word vector.
  • the right picture of Figure 12 shows the calculation process of the i-th layer of the autoregressive word vector encoder.
  • Each box in the figure represents a word vector, the lower line represents the word vector input by this layer, and the upper line represents the word vector output by this layer.
  • each arrow ai represents the dependence of each output word vector on the input word vector, and whether the dependence exists can be determined by the attention matrix.
  • the word vector sequence containing its context information and the query information can be input into the prediction module, and the prediction module will be based on the word vector sequence containing its context information and Query information to make predictions and get prediction labels.
  • Figure 14 shows an example of a random prediction order.
  • the original sentence is "the cat sat on the mat”
  • the preprocessing module randomly rearranges the original sentence
  • the rearranged sequence is "the on mat sat the cat”
  • the rearranged sentence is divided into blocks, the first
  • the block is "the on mat sat”
  • any word in the block can be seen by all other words in the sentence
  • the second block is "the cat”
  • any word in the second block can only be seen by the following words.
  • the words are visible (in the rearranged sequence), the words of the second block will be predicted in this example, and since the original sentence is randomly rearranged, the predicted order of the words of the second block is random.
  • the module obtains the attention matrix according to the sentence block.
  • This module outputs the rearranged word vector sequence (M first embedding vectors) and the attention matrix (that is, the attention information in the above embodiment) to the autoregressive word vector encoding module (that is, the target encoder in the above embodiment). ), the auxiliary information of the label to be predicted (that is, the second position indicated by the second embedded vector in the above embodiment and the third position indicated by the fifth embedded vector, in the example of FIG.
  • the auxiliary information is position 1 and position 2 , indicating that the model will predict the words at position 1 and position 2 in the original sequence, the and cat respectively) and output to the query module, and the query module can generate the second embedding vector and the fifth embedding vector.
  • a random sequence is used for prediction, and the sequence information of the data units to be predicted is fully utilized, and the sequence information is explicitly incorporated into the output vector.
  • the method for predicting a word to be predicted is described above by taking text data as an example, and the data processing method in the embodiment of the present application may also be applied to the field of computer vision or speech.
  • the above-mentioned target text can be replaced with a sequence of pictures or voices, and the sequence correspondingly undergoes operations such as out-of-order, block, etc. by the preprocessing module to obtain the rearranged vector sequence of pictures or voice units and the position of the position to be predicted. information, and enter the autoregressive coding module and the query module, and finally the prediction module obtains the corresponding picture or speech unit of the position to be predicted.
  • This embodiment of the present application may also be presented in the form of a service or software on the cloud side.
  • the service or software may have a function of obtaining a data unit to be predicted based on a known data unit in the target data.
  • the service includes, but is not limited to, prediction and recovery of content at any position in text data (sentences, paragraphs, chapters, etc.), recovery of blurred speech or missing audio sequences in speech data, and recovery of blurred/broken pixels in picture/video data. recovery etc.
  • An embodiment of the present application provides a data processing method, the method includes: acquiring M first embedding vectors and second embedding vectors; wherein each first embedding vector is used to represent a known data in the target data unit and the first position of the one known data unit in the target data, and the second embedding vector is used to represent the second position of the first data unit to be predicted in the target data in the target data position; the M is a positive integer; the M first embedded vectors are processed by the target encoder to obtain M first output vectors corresponding to the M known data units, wherein each of the It is known that the first output vector corresponding to the data unit is generated according to the M first embedding vectors; through the target prediction network, the M first output vectors and the second embedding vectors are processed to obtain the The first data unit to be predicted.
  • the target encoder can use the M first embedding vectors as inputs, wherein the first embedding vectors include the position information and the known data units. data information, there is no need to separately set additional M pieces of position information as the input of the target encoder.
  • the number of latent variables in the intermediate output of the target encoder is also consistent with the number of input embedding vectors, reducing the target Encoder computation and memory consumption.
  • FIG. 15 and FIG. 16 are a kind of data provided by the embodiment of the present application.
  • a schematic flowchart of the processing method, as shown in FIG. 15 the data processing method provided by the embodiment of the present application includes:
  • each first embedding vector is used to represent a known data unit in the target data and the one The first position of the known data unit in the target data, the second embedding vector is used to represent the second position of the first data unit to be predicted in the target data in the target data;
  • the M is a positive integer;
  • the first encoder and the first prediction network are neural network models to be trained.
  • the target data is text data
  • the known data unit is a known word in the text data
  • the first to-be-predicted data unit is a to-be-predicted data unit in the text data word
  • Described target data is speech data
  • described known data unit is the known audio sequence in described speech data
  • described first data unit to be predicted is the audio sequence to be predicted in described speech data
  • the target data is image data
  • the known data unit is a known pixel in the image data
  • the first to-be-predicted data unit is a to-be-predicted pixel in the image data.
  • the first position is used to indicate a relative positional relationship between the known data unit and other known data units and between the known data unit and the first to-be-predicted data unit ;
  • the second position is used to indicate the relative positional relationship between the first data unit to be predicted and each known data unit in the target data.
  • the first encoder is a first transformation transformer layer
  • the first prediction network is a second transformer layer.
  • step 150 For more description of step 1501, reference may be made to the description of step 601, which will not be repeated here.
  • the first transformer layer includes multiple sub-transformer layers in series; through each of the sub-transformer layers, the output of the previous sub-transformer layer adjacent to the sub-transformer layer may be The data is processed to obtain M intermediate vectors, and the M intermediate vectors are output to the next sub-transformer layer adjacent to the sub-transformer layer; wherein, if the sub-transformer layer is the plurality of sub-transformers The transformer layer closest to the input side among the layers, the input data of the sub-transformer layer is the M first embedding vectors; if the sub-transformer layer is the transformer layer closest to the output side among the multiple sub-transformer layers , the data output by the sub-transformer layer is the M first output vectors.
  • step 1502 For more description of step 1502, reference may be made to the description of step 602, and the similarities will not be repeated here.
  • the third prediction data unit is the result of prediction performed by the first prediction network.
  • step 1503 For more description of step 1503, reference may be made to the description of step 603, and the similarities will not be repeated here.
  • the third prediction data unit is the result of the prediction performed by the first prediction network, so it is necessary to construct a loss based on the difference between the third prediction data unit and the first to-be-predicted data unit, and update the first code based on the constructed loss encoder and the first prediction network to obtain a target encoder and a target prediction network.
  • a loss based on the difference between the third prediction data unit and the first to-be-predicted data unit
  • update the first code based on the constructed loss encoder and the first prediction network to obtain a target encoder and a target prediction network.
  • other network structures such as the embedding layer can also be updated based on the above loss, which is not limited here.
  • the target data further includes a second to-be-predicted data unit; the M first embedded vectors may be processed by the first encoder to obtain each predicted data unit.
  • the prediction sequence is used to indicate the second to-be-predicted data unit.
  • the predicted data unit is predicted after the first to-be-predicted data unit, then after the third predicted data unit is obtained, a fourth embedded vector and a fifth embedded vector are obtained, and the fourth embedded vector is used to represent the The first to-be-predicted data unit and the second position of the first to-be-predicted data unit in the target data, the fifth embedded vector is used to indicate that the second to-be-predicted data unit in the target data is in the target data.
  • the difference between the third predicted data unit and the first to-be-predicted data unit, and the difference between the fourth predicted data unit and the second to-be-predicted data unit update the first encoder and the first prediction network to obtain a target encoder and a target prediction network.
  • the second output vector corresponding to each known data unit is generated according to the M first embedding vectors; the second output vector corresponding to the first data unit to be predicted is generated according to the M first embedding vectors and the fourth embedding vectors.
  • the parameter tuning in the training phase can be performed using the standard back propagation algorithm in deep learning.
  • the loss function at this stage can be:
  • ⁇ 1 is all the parameters of the model (including transformer parameters, position vector parameters and classifier parameters)
  • x is the entire input sequence, including several elements
  • y represents all the words that need to be predicted (that is, the original corresponding to each predicted position).
  • S represents the set of positions of all words in y
  • y i represents the word that needs to be predicted at the ith position.
  • an embodiment of the present application further provides a data processing method, which includes:
  • the second embedding vector in this embodiment is used to indicate the target processing task, wherein the target processing task includes but is not limited to: short text classification, long text classification, natural language inference, Text similarity matching, sentiment classification, and more.
  • the first position is used to indicate a relative positional relationship between the data unit and other data units.
  • the target data is text data
  • the data unit is a word in the text data
  • the target data is voice data
  • the known data unit is an audio sequence in the voice data
  • the target data is image data
  • the known data units are pixels in the image data.
  • step 1701 For more specific description of step 1701, reference may be made to the description of step 601 in the foregoing embodiment, which will not be repeated here.
  • the 1702. Process the M first embedded vectors by the target encoder to obtain M output vectors corresponding to the M data units, wherein the output vectors corresponding to each of the data units are based on the M output vectors.
  • the first embedding vector is generated.
  • the target encoder in the embodiment of the present application can be obtained by performing fine-tuning of the model for the target processing task by using the target encoder in the embodiment corresponding to FIG. 6a as a pre-training model.
  • the target encoder is the first transform transformer layer.
  • the first transformer layer includes multiple sub-transformer layers in series; through each of the sub-transformer layers, the output of the previous sub-transformer layer adjacent to the sub-transformer layer may be The data is processed to obtain M intermediate vectors, and the M intermediate vectors are output to the next sub-transformer layer adjacent to the sub-transformer layer; wherein, if the sub-transformer layer is the plurality of sub-transformers The transformer layer closest to the input side among the layers, the input data of the sub-transformer layer is the M first embedding vectors; if the sub-transformer layer is the transformer layer closest to the output side among the multiple sub-transformer layers , the data output by the sub-transformer layer is the M output vectors.
  • the target encoder includes an attention head, and can obtain attention information, where the attention information is used to indicate that when the attention head processes the M first embedding vectors, the There is an attention correlation between any two first embedding vectors in the M first embedding vectors; according to the attention information, the M first embedding vectors are processed by the target encoder.
  • step 1702 For more specific description of step 1702, reference may be made to the description of step 602 in the foregoing embodiment, and details are not repeated here.
  • the prediction network is the second transformer layer.
  • the data processing method provided by the embodiment of the present application will be described by taking the target processing task as a multi-task text classification task and reading comprehension as an example.
  • FIG. 18 is an example of an embodiment about text sentiment classification.
  • the target data is "the cat sat on the mat”
  • the attention matrix is obtained according to the sentence segmentation.
  • the i-th row of the matrix If the element in column j is 1 (white), it means that in the subsequent modeling process, the jth word in the rearrangement sequence is visible to the ith word, and vice versa.
  • This module outputs the rearranged word vector sequence and the attention matrix (obtained according to the block) to the autoregressive word vector encoding module, and outputs the auxiliary information (task type, such as sentiment classification) of the label to be predicted to the query module.
  • task type such as sentiment classification
  • the autoregressive module can use the transformer layer as the autoregressive word vector encoder. This module compares each word vector in the rearranged word vector sequence with its corresponding position vector (each position corresponds to a position vector, which is part of the parameters of the model). Add, and the attention matrix provided by the preprocessing module is used in the modeling process, which defines whether each word is visible to other words in the process of the transformer modeling word representation, the solid line in Figure 18 indicates that it is visible, The transformer finally obtains a word vector representation incorporating contextual information for each word, and outputs it to the prediction module. The query module outputs the task vector corresponding to the task type and outputs it to the prediction module. The prediction module still uses the transformer model, which models the vector representation of the sentence. Each word vector that is finally modeled goes through a classifier to predict the corresponding word.
  • the parameter tuning in the fine-tuning stage can be performed by using the standard back propagation algorithm in deep learning.
  • the loss function at this stage can be:
  • ⁇ 2 is all the parameters of the model (including Transformer parameters, position vector parameters, task encoding parameters and classifier parameters)
  • x is the entire input sequence, including several elements
  • y represents the label corresponding to the sentence.
  • FIG. 19 is an example of an embodiment of reading comprehension for span extraction.
  • the task is to find the segment (span) of the answer in the text, that is, the position of the start (START) and end (END) in the text (ie "the” and "cat”).
  • the attention matrix is obtained according to the sentence block. If the element of the i-th row and the j-th column of the matrix is 1 (white), it means that in the subsequent modeling process, the j-th word in the rearrangement sequence is paired with the i words are visible, otherwise invisible.
  • This module outputs the rearranged word vector sequence and the attention matrix (obtained according to the block) to the autoregressive word vector encoding module, and outputs the auxiliary information of the label to be predicted (the position information of each word in the chapter) to the query module. .
  • the autoregressive module can use the transformer as the autoregressive word vector encoder.
  • This module adds each word vector in the rearranged word vector sequence to its corresponding position vector (each position corresponds to a position vector, which is part of the parameters of the model) , and the attention matrix provided by the preprocessing module is used in the modeling process, which defines whether each word is visible to other words in the process of Transformer modeling word representation.
  • the solid line in the figure indicates that it is visible, and the Transformer finally A word vector representation incorporating contextual information is obtained for each word and output to the prediction module.
  • the query module outputs the task vector corresponding to the task type and outputs it to the prediction module.
  • the prediction module still uses the Transformer model, which models the vector representation of the sentence.
  • Each word vector that is finally modeled goes through two classifiers (respectively outputs the probability of whether each word is START and END, as shown in the table in Figure 19) to predict the position of the corresponding START and END.
  • the model predicts the probabilities of START and END for each word in the passage.
  • the parameter tuning in the fine-tuning stage is performed using the standard back-propagation algorithm in deep learning.
  • the loss function at this stage can be:
  • ⁇ 3 is all the parameters of the model (including Transformer parameters, position vector parameters, task encoding parameters and classifier parameters), x is the entire input sequence, including several elements, P(y START
  • the over-fine-tuned model can be used as the prediction of the downstream task.
  • the prediction method of the model is the same as that in the fine-tuning stage. After four modules and classifiers, sentences or words are obtained. Tag of.
  • the model will take the word with the highest probability predicted by the START classifier as the word at the start position of the span, and then take the word with the highest END probability after the start position as the word at the end position of the span.
  • FIG. 20 is a schematic structural diagram of a data processing apparatus 2000 provided by an embodiment of the application.
  • the data processing apparatus 2000 may be a terminal device or a server, and the data processing apparatus 2000 includes:
  • Obtaining module 2001 is configured to obtain M first embedded vectors and second embedded vectors; wherein, each first embedded vector is used to represent a known data unit in the target data and where the one known data unit is located. the first position in the target data, and the second embedding vector is used to represent the second position of the first data unit to be predicted in the target data in the target data; the M is a positive integer;
  • step 601 For the specific description of the obtaining module 2001, reference may be made to the description of step 601 in the foregoing embodiment, and details are not repeated here.
  • An encoding module 2002 configured to process the M first embedded vectors through the target encoder to obtain M first output vectors corresponding to the M known data units, wherein each of the known data units The corresponding first output vector is generated according to the M first embedded vectors;
  • the prediction module 2003 is configured to process the M first output vectors and the second embedded vector through a target prediction network to obtain the first to-be-predicted data unit.
  • the first position is used to indicate a relative positional relationship between the known data unit and other known data units and between the known data unit and the first to-be-predicted data unit ;
  • the second position is used to indicate the relative positional relationship between the first data unit to be predicted and each known data unit in the target data.
  • the target encoder is a first transform transformer layer
  • the target prediction network is a second transformer layer.
  • the first transformer layer includes multiple sub-transformer layers in series; the M first embedding vectors are processed by the target encoder to obtain M known data units The corresponding M first output vectors, including:
  • the data output by the previous sub-transformer layer adjacent to the sub-transformer layer is processed to obtain M intermediate vectors, and the M intermediate vectors are output to the same as the sub-transformer layer.
  • the next sub-transformer layer adjacent to the sub-transformer layer wherein, if the sub-transformer layer is the transformer layer closest to the input side among the multiple sub-transformer layers, the input data of the sub-transformer layer is the M The first embedding vector; if the sub-transformer layer is the transformer layer closest to the output side among the plurality of sub-transformer layers, the data output by the sub-transformer layer is the M first output vectors.
  • the target encoder includes an attention head
  • the encoding module is configured to obtain attention information, where the attention information is used to indicate that the attention head is processing the Mth
  • the attention information is used to indicate that the attention head is processing the Mth
  • the M first embedding vectors are processed by the target encoder.
  • the apparatus further includes:
  • an embedding module configured to perform embedding processing on the M known data units in the target data through the embedding layer to obtain M third embedding vectors;
  • the target data further includes a second to-be-predicted data unit, and the prediction sequence of the second to-be-predicted data unit and the first to-be-predicted data unit is determined randomly.
  • the method further includes:
  • the fourth embedded vector is used to represent the first to-be-predicted data unit and the second position of the first to-be-predicted data unit in the target data
  • the The fifth embedded vector is used to represent the third position of the second to-be-predicted data unit in the target data
  • the M first embedded vectors and the fourth embedded vectors are processed to obtain M known data units and M+1th data units corresponding to the first to-be-predicted data units two output vectors;
  • the M+1 second output vectors and the fifth embedded vector are processed to obtain the second to-be-predicted data unit.
  • the second output vector corresponding to each known data unit is generated according to the M first embedding vectors; the second output vector corresponding to the first data unit to be predicted is generated according to the M first embedding vectors and the fourth embedding vectors.
  • the target data is text data
  • the known data unit is a known word in the text data
  • the first to-be-predicted data unit is a to-be-predicted data unit in the text data word
  • the target data is voice data
  • the known data unit is a known audio sequence in the voice data
  • the first to-be-predicted data unit is a to-be-predicted audio sequence in the voice data
  • the target data is image data
  • the known data unit is a known pixel in the image data
  • the first to-be-predicted data unit is a to-be-predicted pixel in the image data.
  • FIG. 21 is a schematic structural diagram of a data processing apparatus 2100 provided by an embodiment of the present application.
  • the data processing apparatus 2100 may be a terminal device or a server, and the data processing apparatus 2100 includes:
  • Obtaining module 2101 for obtaining M first embedded vectors and second embedded vectors; wherein, each first embedded vector is used to represent a data unit in the target data and the one data unit in the target data The first position of , the second embedding vector is used to indicate the target processing task; the M is a positive integer;
  • the encoding module 2102 is configured to process the M first embedded vectors through the target encoder to obtain M output vectors corresponding to the M data units, wherein the output vectors corresponding to each of the data units are based on The M first embedding vectors are generated;
  • step 1702 For the specific description of the encoding module 2102, reference may be made to the description of step 1702 in the foregoing embodiment, and details are not repeated here.
  • the task processing module 2103 is configured to perform processing corresponding to the target processing task on the M output vectors and the second embedding vector through a task network to obtain a task processing result.
  • the first position is used to indicate a relative positional relationship between the data unit and other data units.
  • the target encoder is a first transformer layer
  • the task network is a second transformer layer
  • the first transformer layer includes multiple sub-transformer layers in series; the M first embedding vectors are processed by the target encoder to obtain M known data units The corresponding M first output vectors, including:
  • the data output by the previous sub-transformer layer adjacent to the sub-transformer layer is processed to obtain M intermediate vectors, and the M intermediate vectors are output to the same as the sub-transformer layer.
  • the next sub-transformer layer adjacent to the sub-transformer layer wherein, if the sub-transformer layer is the transformer layer closest to the input side among the multiple sub-transformer layers, the input data of the sub-transformer layer is the M The first embedding vector; if the sub-transformer layer is the transformer layer closest to the output side among the multiple sub-transformer layers, the data output by the sub-transformer layer is the M output vectors.
  • the target encoder includes an attention head
  • the encoding module is configured to obtain attention information, where the attention information is used to indicate that the attention head is processing the Mth
  • the attention information is used to indicate that the attention head is processing the Mth
  • the M first embedding vectors are processed by the target encoder.
  • the target data is text data
  • the data unit is a word in the text data
  • the target data is voice data
  • the known data unit is an audio sequence in the voice data
  • the target data is image data
  • the known data units are pixels in the image data.
  • the target processing task includes short text classification, long text classification, natural language inference, text similarity matching or text sentiment classification.
  • FIG. 22 is a schematic structural diagram of a data processing apparatus 2200 provided by an embodiment of the application.
  • the data processing apparatus 2200 may be a terminal device or a server, and the data processing apparatus 2200 includes:
  • Obtaining module 2201 configured to obtain a first encoder, a first prediction network, M first embedding vectors, and a second embedding vector; wherein each first embedding vector is used to represent a known data unit in the target data and the first position of the one known data unit in the target data, the second embedding vector is used to represent the second position of the first to-be-predicted data unit in the target data in the target data ;
  • the M is a positive integer;
  • the encoding module 2202 is configured to process the M first embedded vectors through the first encoder to obtain M first output vectors corresponding to the M known data units, wherein each of the Knowing that the first output vector corresponding to the data unit is generated according to the M first embedding vectors;
  • step 1502 For the specific description of the encoding module 2202, reference may be made to the description of step 1502 in the foregoing embodiment, and details are not repeated here.
  • a prediction module 2203 configured to process the M first output vectors and the second embedded vectors through the first prediction network to obtain a third prediction data unit;
  • prediction module 2203 For the specific description of the prediction module 2203, reference may be made to the description of step 1503 in the foregoing embodiment, and details are not repeated here.
  • a model training module 2204 configured to update the first encoder and the first prediction network based on the difference between the third predicted data unit and the first to-be-predicted data unit to obtain the target encoder and target prediction network.
  • model training module 2204 For the specific description of the model training module 2204, reference may be made to the description of step 1504 in the foregoing embodiment, and details are not repeated here.
  • the first position is used to indicate a relative positional relationship between the known data unit and other known data units and between the known data unit and the first to-be-predicted data unit ;
  • the second position is used to indicate the relative positional relationship between the first data unit to be predicted and each known data unit in the target data.
  • the first encoder is a first transform transformer layer
  • the first prediction network is a second transformer layer.
  • the first transformer layer includes a plurality of sub-transformer layers in series; the M first embedded vectors are processed by the first encoder to obtain M M first output vectors corresponding to known data units, including:
  • the data output by the previous sub-transformer layer adjacent to the sub-transformer layer is processed to obtain M intermediate vectors, and the M intermediate vectors are output to the same as the sub-transformer layer.
  • the next sub-transformer layer adjacent to the sub-transformer layer wherein, if the sub-transformer layer is the transformer layer closest to the input side among the multiple sub-transformer layers, the input data of the sub-transformer layer is the M The first embedding vector; if the sub-transformer layer is the transformer layer closest to the output side among the plurality of sub-transformer layers, the data output by the sub-transformer layer is the M first output vectors.
  • the target data further includes a second to-be-predicted data unit, and the prediction sequence of the second to-be-predicted data unit and the first to-be-predicted data unit is determined randomly.
  • the method further includes:
  • the fourth embedded vector is used to represent the first to-be-predicted data unit and the second position of the first to-be-predicted data unit in the target data
  • the The fifth embedded vector is used to represent the third position of the second data unit to be predicted in the target data in the target data
  • the M first embedded vectors and the fourth embedded vectors are processed to obtain M known data units and M+1 corresponding to the first to-be-predicted data unit the second output vector;
  • the M+1 second output vectors and the fifth embedded vector are processed to obtain the fourth to-be-predicted data unit;
  • the first encoder and the first prediction network are updated based on the difference between the third prediction data unit and the first to-be-predicted data unit to obtain a target encoder and a target prediction network, including :
  • the second output vector corresponding to each known data unit is generated according to the M first embedding vectors; the second output vector corresponding to the first data unit to be predicted is generated according to the M first embedding vectors and the fourth embedding vectors.
  • the target data is text data
  • the known data unit is a known word in the text data
  • the first to-be-predicted data unit is a to-be-predicted data unit in the text data word
  • the target data is voice data
  • the known data unit is a known audio sequence in the voice data
  • the first to-be-predicted data unit is a to-be-predicted audio sequence in the voice data
  • the target data is image data
  • the known data unit is a known pixel in the image data
  • the first to-be-predicted data unit is a to-be-predicted pixel in the image data.
  • FIG. 23 is a schematic structural diagram of an execution device provided by an embodiment of the present application.
  • the execution device 2300 may specifically be represented as a virtual reality VR device, a mobile phone, Tablets, laptops, smart wearable devices, monitoring data processing devices or servers, etc., are not limited here.
  • the execution device 2300 includes: a receiver 2301, a transmitter 2302, a processor 2303, and a memory 2304 (wherein the number of processors 2303 in the execution device 2300 may be one or more, and one processor is taken as an example in FIG. 23) , wherein the processor 2303 may include an application processor 23031 and a communication processor 23032.
  • the receiver 2301, the transmitter 2302, the processor 2303, and the memory 2304 may be connected by a bus or otherwise.
  • Memory 2304 may include read-only memory and random access memory, and provides instructions and data to processor 2303 .
  • a portion of memory 2304 may also include non-volatile random access memory (NVRAM).
  • NVRAM non-volatile random access memory
  • the memory 2304 stores processors and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, wherein the operating instructions may include various operating instructions for implementing various operations.
  • the processor 2303 controls the operation of the execution device.
  • various components of the execution device are coupled together through a bus system, where the bus system may include a power bus, a control bus, a status signal bus, and the like in addition to a data bus.
  • the various buses are referred to as bus systems in the figures.
  • the methods disclosed in the above embodiments of the present application may be applied to the processor 2303 or implemented by the processor 2303 .
  • the processor 2303 may be an integrated circuit chip, which has signal processing capability. In the implementation process, each step of the above-mentioned method can be completed by an integrated logic circuit of hardware in the processor 2303 or an instruction in the form of software.
  • the above-mentioned processor 2303 may be a general-purpose processor, a digital signal processor (DSP), a microprocessor or a microcontroller, and may further include an application specific integrated circuit (ASIC), a field programmable Field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable Field-programmable gate array
  • the processor 2303 may implement or execute the methods, steps, and logical block diagrams disclosed in the embodiments of this application.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 2304, and the processor 2303 reads the information in the memory 2304, and completes the steps of the above method in combination with its hardware.
  • the receiver 2301 can be used to receive input numerical or character information, and generate signal input related to the relevant setting and function control of the execution device.
  • the transmitter 2302 can be used to output digital or character information through the first interface; the transmitter 2302 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 2302 can also include a display device such as a display screen .
  • the processor 2303 is configured to execute the data processing methods described in the corresponding embodiments of FIG. 6a and FIG. 17 .
  • FIG. 24 is a schematic structural diagram of the training device provided by the embodiment of the present application.
  • the training device 2400 is implemented by one or more servers.
  • the training device 2400 can vary widely by configuration or performance, and can include one or more central processing units (CPUs) 2424 (eg, one or more processors) and memory 2432, one or more storage applications Storage medium 2430 (eg, one or more mass storage devices) for programs 2442 or data 2444.
  • the memory 2432 and the storage medium 2430 may be short-term storage or persistent storage.
  • the program stored in the storage medium 2430 may include one or more modules (not shown in the figure), and each module may include a series of instructions to operate on the training device. Further, the central processing unit 2424 may be configured to communicate with the storage medium 2430 to execute a series of instruction operations in the storage medium 2430 on the training device 2400.
  • the training device 2400 may also include one or more power supplies 2426, one or more wired or wireless network interfaces 2450, one or more input and output interfaces 2458; or, one or more operating systems 2441, such as Windows ServerTM, Mac OS XTM , UnixTM, LinuxTM, FreeBSDTM and so on.
  • operating systems 2441 such as Windows ServerTM, Mac OS XTM , UnixTM, LinuxTM, FreeBSDTM and so on.
  • the central processing unit 2424 is configured to execute the data processing method described in the embodiment corresponding to FIG. 15 .
  • Embodiments of the present application also provide a computer program product that, when running on a computer, causes the computer to perform the steps performed by the aforementioned execution device, or causes the computer to perform the steps performed by the aforementioned training device.
  • Embodiments of the present application further provide a computer-readable storage medium, where a program for performing signal processing is stored in the computer-readable storage medium, and when it runs on a computer, the computer executes the steps performed by the aforementioned execution device. , or, causing the computer to perform the steps as performed by the aforementioned training device.
  • the execution device, training device, or terminal device provided in this embodiment of the present application may specifically be a chip, and the chip includes: a processing unit and a communication unit, the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, pins or circuits, etc.
  • the processing unit can execute the computer executable instructions stored in the storage unit, so that the chip in the execution device executes the data processing method described in the above embodiments, or the chip in the training device executes the data processing method described in the above embodiment.
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • FIG. 25 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • the chip can be expressed as a neural network processor NPU 2500, and the NPU 2500 is mounted on the main CPU (Host CPU) as a co-processor, and the Host CPU assigns tasks.
  • the core part of the NPU is the operation circuit 2503, and the controller 2504 controls the operation circuit 2503 to extract the data in the memory (weight memory or input memory) and perform operations.
  • the arithmetic circuit 2503 includes multiple processing units (Process Engine, PE). In some implementations, the arithmetic circuit 2503 is a two-dimensional systolic array. The arithmetic circuit 2503 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 2503 is a general-purpose matrix processor.
  • the arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 2502 and buffers it on each PE in the arithmetic circuit.
  • the arithmetic circuit fetches the data of matrix A and matrix B from the input memory 2501 to perform matrix operation, and stores the partial result or final result of the matrix in the accumulator 2508 .
  • Unified memory 2506 is used to store input data and output data.
  • the weight data is directly passed through the storage unit access controller (Direct Memory Access Controller, DMAC) 2505, and the DMAC is transferred to the weight memory 2502.
  • Input data is also moved to unified memory 2506 via the DMAC.
  • DMAC Direct Memory Access Controller
  • the BIU is the Bus Interface Unit, that is, the bus interface unit 2510, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer (Instruction Fetch Buffer, IFB) 2509.
  • IFB Instruction Fetch Buffer
  • the bus interface unit 2510 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 2509 to obtain instructions from the external memory, and also for the storage unit access controller 2505 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 2506 , the weight data to the weight memory 2502 , or the input data to the input memory 2501 .
  • the vector calculation unit 2507 includes a plurality of operation processing units, and further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc., if necessary. It is mainly used for non-convolutional/fully connected layer network computation in neural networks, such as Batch Normalization, pixel-level summation, and upsampling of feature planes.
  • the vector computation unit 2507 can store the processed output vectors to the unified memory 2506 .
  • the vector calculation unit 2507 may apply a linear function; or a non-linear function to the output of the operation circuit 2503, such as performing linear interpolation on the feature plane extracted by the convolution layer, such as a vector of accumulated values, to generate activation values.
  • the vector computation unit 2507 generates normalized values, pixel-level summed values, or both.
  • the vector of processed outputs can be used as an activation input to the arithmetic circuit 2503, such as for use in subsequent layers in a neural network.
  • the instruction fetch memory (instruction fetch buffer) 2509 connected to the controller 2504 is used to store the instructions used by the controller 2504;
  • the unified memory 2506, the input memory 2501, the weight memory 2502 and the instruction fetch memory 2509 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • the processor mentioned in any one of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above program.
  • the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be A physical unit, which can be located in one place or distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be retrieved from a website, computer, training device, or data Transmission from the center to another website site, computer, training facility or data center via wired (eg coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.) means.
  • wired eg coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a training device, a data center, or the like that includes an integration of one or more available media.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种数据处理方法,包括:获取用于表示已知数据单元以及已知数据单元位置的第一嵌入向量、以及用于表示待预测数据单元位置的第二嵌入向量,并通过目标编码器,对第一嵌入向量进行处理,以得到输出向量,通过目标预测网络,对输出向量以及第二嵌入向量进行处理,以得到待预测数据单元。该方法不需要再单独设置额外的M个位置信息来作为目标编码器的输入,且目标编码器的中间输出的隐变量的数量也和输入的嵌入向量的数量保持一致,减少了目标编码器的计算量和内存消耗。

Description

一种数据处理方法及相关设备
本申请要求于2021年4月18日提交中国专利局、申请号为202110415349.1、发明名称为“一种数据处理方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种数据处理方法及相关设备。
背景技术
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用***。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
语言模型(Language Model)是指能够根据一部分给定的语义片段,预测句子中的未知词的模型。例如:给定的自然语言序列片段“华为__很不错。”,语言模型可以根据该片段生成未知的词语,如该例子中语言模型可基于给定片段生成“手机”一词,进而得到句子为“华为手机很不错。”。
在现有的自然语言生成模型(参照图6b)中,将自编码模型和自回归语言模型进行了融合,相比自编码模型和自回归语言模型,该模型将隐状态数量扩大了一倍,其中白色部分对应自编码模型,灰色部分对应自回归语言模型,自编码模型相关的隐变量用于表示位置信息,自回归模型用于提供自编码语言模型预测的上下文信息,该模型的计算量和内存消耗是自编码和自回归模型的两倍。因此,需要提供一种计算量和内存消耗较少的语言模型。
发明内容
第一方面,本申请提供了一种数据处理方法,其特征在于,所述方法包括:
获取M个第一嵌入向量、以及第二嵌入向量;其中,每个第一嵌入向量用于表示目标数据中的一个已知数据单元以及所述一个已知数据单元在所述目标数据中的第一位置,所述第二嵌入向量用于表示所述目标数据中的第一待预测数据单元在所述目标数据中的第二位置;所述M为正整数;
其中,目标数据为存在数据缺失的数据,其中,目标数据包括未缺失的数据(本申请实施例中称之为已知数据单元)以及缺失的数据(本申请实施例中称之为待预测数据单元,例如第一待预测数据单元以及第二待预测数据单元)。其中,已知数据单元为未缺失的数据中的数据单元,例如所述目标数据可以为文本数据,则目标数据中的已知数据单元可以为文本数据中已知的单词或者是已知的字,待预测数据单元可以为所述文本数据中的待预测词或者是待预测字。例如所述目标数据可以为语音数据,则目标数据中的已知数据单元可 以为语音数据中的已知音频序列,待预测数据单元可以为所述语音数据中的待预测音频序列。例如所述目标数据可以为图像数据,则目标数据中的已知数据单元可以为语音数据中的已知像素点,待预测数据单元可以为所述语音数据中的待预测像素点。应理解,已知数据单元和待预测数据单元的数据粒度和目标数据的类型有关,已知数据单元和待预测数据单元的数据粒度可以是目标数据中最小的数据单元或者是由最小的数据单元组成的多个数据单元,这里并不限定已知数据单元和待预测数据单元的粒度。
通过目标编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,其中,每个所述已知数据单元对应的第一输出向量为根据所述M个第一嵌入向量生成的;
其中,所谓每个第一输出向量为基于M个第一嵌入向量得到的,可以理解为每个第一输出向量可以将M个第一嵌入向量作为参考,也就是在生成每个第一输出向量时,各个第一嵌入向量都是可见的,或者,每个第一输出向量与M个第一嵌入向量存在依赖关系;
在一种实现中,所述目标编码器可以为转换transformer层,则每个第一输出向量为基于M个第一嵌入向量得到可以理解为M个第一嵌入向量中任意两个第一嵌入向量之间存在注意力关联。
通过目标预测网络,对所述M个第一输出向量以及所述第二嵌入向量进行处理,以得到所述第一待预测数据单元;
本申请实施例,针对于M个已知数据单元对应的M个第一嵌入向量,目标编码器可以将M个第一嵌入向量作为输入,其中第一嵌入向量包括了各个已知数据单元的位置信息和已知数据单元的数据信息,而不需要再单独设置额外的M个位置信息来作为目标编码器的输入,此外,目标编码器的中间输出的隐变量的数量也和输入的嵌入向量的数量保持一致,减少了目标编码器的计算量和内存消耗。
在一种可能的实现中,所述第一位置用于指示所述已知数据单元与其他已知数据单元以及所述已知数据单元与所述第一待预测数据单元之间的相对位置关系;所述第二位置用于指示所述第一待预测数据单元与所述目标数据中各个已知数据单元之间的相对位置关系。
在一种可能的实现中,所述目标编码器为第一转换transformer层,所述目标预测网络为第二transformer层。
在一种可能的实现中,所述第一transformer层包括串行的多个子transformer层;所述通过目标编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,包括:
通过每个所述子transformer层,对与所述子transformer层相邻的上一个子transformer层输出的数据进行处理,以得到M个中间向量,并将所述M个中间向量输出至与所述子transformer层相邻的下一个子transformer层;其中,若所述子transformer层为所述多个子transformer层中最靠近输入侧的transformer层,则所述子transformer层的输入数据为所述M个第一嵌入向量;若所述子transformer层为所述多个子transformer层中最靠近输出侧的 transformer层,则所述子transformer层输出的数据为所述M个第一输出向量。
也就是说,每个子transformer层的输入包括与M个已知数据单元对应的M个特征向量,每个子transformer层的输出包括与M个已知数据单元对应的M个输出向量。进而使得目标编码器的中间输出的隐变量的数量也和输入的嵌入向量的数量保持一致,减少了目标编码器的计算量和内存消耗。
在一种可能的实现中,所述目标编码器包括注意力头,所述通过目标编码器,对所述M个第一嵌入向量进行处理,包括:
获取注意力信息,所述注意力信息用于指示所述注意力头在处理所述M个第一嵌入向量时,所述M个第一嵌入向量中任意两个第一嵌入向量之间存在注意力关联;
根据所述注意力信息,通过所述目标编码器,对所述M个第一嵌入向量进行处理。
在一种可能的实现中,所述方法还包括:
通过嵌入层对所述目标数据中的M个已知数据单元进行嵌入处理,以得到M个第三嵌入向量;其中嵌入层可以称为输入嵌入(input embedding)层。当前输入可以为M个已知数据单元。嵌入层在获取当前输入后,可以对该当前输入中各个已知数据单元进行嵌入处理,可得到各个已知数据单元对应的嵌入向量(也就是第三嵌入向量);
获取所述M个已知数据单元中的每个已知数据单元的位置向量,所述位置向量用于指示所述第一位置;在一些实施例中,还可以获取所述M个已知数据单元中的每个已知数据单元的位置向量,所述位置向量用于指示所述第一位置;其中,第一位置用于表示已知数据单元在目标数据中的位置,具体的,所述第一位置可以用于指示所述目标数据中已知数据单元与除自身之外的其他已知数据单元以及所述第一待预测数据单元之间的相对位置关系;
将所述M个第三嵌入向量中的每个第三嵌入向量与对应的位置向量进行融合,以得到所述M个第一嵌入向量;应理解,融合的方式可以是对第三嵌入向量和位置向量进行加法运算,或者是通过其他运算使得第一嵌入向量可以携带目标数据中的一个已知数据单元以及所述一个已知数据单元在所述目标数据中的第一位置的信息,这里并不限定具体的融合方式。
在一种可能的实现中,所述目标数据还包括第二待预测数据单元,且所述第二待预测数据单元与所述第一待预测数据单元的被预测先后顺序为随机确定的。
在一种可能的实现中,若所述第二待预测数据单元在所述第一待预测数据单元之后被预测,则所述方法还包括:
获取第四嵌入向量和第五嵌入向量,所述第四嵌入向量用于表示所述第一待预测数据单元以及所述第一待预测数据单元在所述目标数据中的第二位置,所述第五嵌入向量用于表示所述第二待预测数据单元在所述目标数据中的第三位置;
通过所述目标编码器,对所述M个第一嵌入向量以及所述第四嵌入向量进行处理,以得到M个已知数据单元以及所述第一待预测数据单元对应的M+1个第二输出向量;
通过所述目标预测网络,对所述M+1个第二输出向量以及所述第五嵌入向量进行处理,以得到所述第二待预测数据单元。
本申请实施例中采用随机顺序的方式来进行预测,充分利用了待预测数据单元的顺序信息,将顺序信息显式的融入输出向量中。
在一种可能的实现中,所述每个已知数据单元对应的第二输出向量为根据所述M个第一嵌入向量生成的;所述第一待预测数据单元对应的第二输出向量为根据所述M个第一嵌入向量以及所述第四嵌入向量生成的。
在一种可能的实现中,所述目标数据为文本数据,所述已知数据单元为所述文本数据中的已知词,所述第一待预测数据单元为所述文本数据中的待预测词;或者,
所述目标数据为语音数据,所述已知数据单元为所述语音数据中的已知音频序列,所述第一待预测数据单元为所述语音数据中的待预测音频序列;或者,
所述目标数据为图像数据,所述已知数据单元为所述图像数据中的已知像素点,所述第一待预测数据单元为所述图像数据中的待预测像素点。
第二方面,本申请提供了一种数据处理方法,所述方法包括:
获取M个第一嵌入向量、以及第二嵌入向量;其中,每个第一嵌入向量用于表示目标数据中的一个数据单元以及所述一个数据单元在所述目标数据中的第一位置,所述第二嵌入向量用于指示目标处理任务;所述M为正整数;
通过目标编码器,对所述M个第一嵌入向量进行处理,以得到M个数据单元对应的M个输出向量,其中,每个所述数据单元对应的输出向量为根据所述M个第一嵌入向量生成的;
通过任务网络,对所述M个输出向量以及所述第二嵌入向量进行所述目标处理任务对应的处理,以得到任务处理结果。
在一种可能的实现中,所述第一位置用于指示所述数据单元与其他数据单元之间的相对位置关系。
在一种可能的实现中,所述目标编码器为第一转换transformer层,所述任务网络为第二transformer层。
在一种可能的实现中,所述第一transformer层包括串行的多个子transformer层;所述通过目标编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,包括:
通过每个所述子transformer层,对与所述子transformer层相邻的上一个子transformer层输出的数据进行处理,以得到M个中间向量,并将所述M个中间向量输出至与所述子transformer层相邻的下一个子transformer层;其中,若所述子transformer层为所述多个子transformer层中最靠近输入侧的transformer层,则所述子transformer层的输入数据为所述M个第一嵌入向量;若所述子transformer层为所述多个子transformer层中最靠近输出侧的transformer层,则所述子transformer层输出的数据为所述M个输出向量。
在一种可能的实现中,所述目标编码器包括注意力头,所述通过目标编码器,对所述M个第一嵌入向量进行处理,包括:
获取注意力信息,所述注意力信息用于指示所述注意力头在处理所述M个第一嵌入向量时,所述M个第一嵌入向量中任意两个第一嵌入向量之间存在注意力关联;
根据所述注意力信息,通过所述目标编码器,对所述M个第一嵌入向量进行处理。
在一种可能的实现中,所述目标数据为文本数据,所述数据单元为所述文本数据中的词;或者,
所述目标数据为语音数据,所述已知数据单元为所述语音数据中的音频序列;或者,
所述目标数据为图像数据,所述已知数据单元为所述图像数据中的像素点。
在一种可能的实现中,所述目标处理任务包括短文本分类、长文本分类、自然语言推断、文本相似度匹配或文本情感分类。
第三方面,本申请提供了一种数据处理方法,所述方法包括:
获取第一编码器、第一预测网络、M个第一嵌入向量、以及第二嵌入向量;其中,每个第一嵌入向量用于表示目标数据中的一个已知数据单元以及所述一个已知数据单元在所述目标数据中的第一位置,所述第二嵌入向量用于表示所述目标数据中的第一待预测数据单元在所述目标数据中的第二位置;所述M为正整数;
通过所述第一编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,其中,每个所述已知数据单元对应的第一输出向量为根据所述M个第一嵌入向量生成的;
通过所述第一预测网络,对所述M个第一输出向量以及所述第二嵌入向量进行处理,以得到第三预测数据单元;
基于所述第三预测数据单元与所述第一待预测数据单元之间的差异,更新所述第一编码器和所述第一预测网络,以得到目标编码器和目标预测网络。
在一种可能的实现中,所述第一位置用于指示所述已知数据单元与其他已知数据单元以及所述已知数据单元与所述第一待预测数据单元之间的相对位置关系;所述第二位置用于指示所述第一待预测数据单元与所述目标数据中各个已知数据单元之间的相对位置关系。
在一种可能的实现中,所述第一编码器为第一转换transformer层,所述第一预测网络 为第二transformer层。
在一种可能的实现中,所述第一transformer层包括串行的多个子transformer层;所述通过所述第一编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,包括:
通过每个所述子transformer层,对与所述子transformer层相邻的上一个子transformer层输出的数据进行处理,以得到M个中间向量,并将所述M个中间向量输出至与所述子transformer层相邻的下一个子transformer层;其中,若所述子transformer层为所述多个子transformer层中最靠近输入侧的transformer层,则所述子transformer层的输入数据为所述M个第一嵌入向量;若所述子transformer层为所述多个子transformer层中最靠近输出侧的transformer层,则所述子transformer层输出的数据为所述M个第一输出向量。
在一种可能的实现中,所述目标数据还包括第二待预测数据单元,且所述第二待预测数据单元与所述第一待预测数据单元的被预测先后顺序为随机确定的。
在一种可能的实现中,若所述第二待预测数据单元在所述第一待预测数据单元之后被预测,则所述方法还包括:
获取第四嵌入向量和第五嵌入向量,所述第四嵌入向量用于表示所述第一待预测数据单元以及所述第一待预测数据单元在所述目标数据中的第二位置,所述第五嵌入向量用于表示所述目标数据中的第二待预测数据单元在所述目标数据中的第三位置;
通过所述第一编码器,对所述M个第一嵌入向量以及所述第四嵌入向量进行处理,以得到M个已知数据单元以及所述第一待预测数据单元对应的M+1个第二输出向量;
通过所述第一预测网络,对所述M+1个第二输出向量以及所述第五嵌入向量进行处理,以得到所述第四待预测数据单元;
所述基于所述第三预测数据单元与所述第一待预测数据单元之间的差异,更新所述第一编码器和所述第一预测网络,以得到目标编码器和目标预测网络,包括:
基于所述第三预测数据单元与所述第一待预测数据单元之间的差异、以及所述第四预测数据单元与所述第二待预测数据单元之间的差异,更新所述第一编码器和所述第一预测网络,以得到目标编码器和目标预测网络。
在一种可能的实现中,所述每个已知数据单元对应的第二输出向量为根据所述M个第一嵌入向量生成的;所述第一待预测数据单元对应的第二输出向量为根据所述M个第一嵌入向量以及所述第四嵌入向量生成的。
在一种可能的实现中,所述目标数据为文本数据,所述已知数据单元为所述文本数据中的已知词,所述第一待预测数据单元为所述文本数据中的待预测词;或者,
所述目标数据为语音数据,所述已知数据单元为所述语音数据中的已知音频序列,所 述第一待预测数据单元为所述语音数据中的待预测音频序列;或者,
所述目标数据为图像数据,所述已知数据单元为所述图像数据中的已知像素点,所述第一待预测数据单元为所述图像数据中的待预测像素点。
第四方面,本申请提供了一种数据处理装置,包括:
获取模块,用于获取M个第一嵌入向量、以及第二嵌入向量;其中,每个第一嵌入向量用于表示目标数据中的一个已知数据单元以及所述一个已知数据单元在所述目标数据中的第一位置,所述第二嵌入向量用于表示所述目标数据中的第一待预测数据单元在所述目标数据中的第二位置;所述M为正整数;
编码模块,用于通过目标编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,其中,每个所述已知数据单元对应的第一输出向量为根据所述M个第一嵌入向量生成的;
预测模块,用于通过目标预测网络,对所述M个第一输出向量以及所述第二嵌入向量进行处理,以得到所述第一待预测数据单元。
在一种可能的实现中,所述第一位置用于指示所述已知数据单元与其他已知数据单元以及所述已知数据单元与所述第一待预测数据单元之间的相对位置关系;所述第二位置用于指示所述第一待预测数据单元与所述目标数据中各个已知数据单元之间的相对位置关系。
在一种可能的实现中,所述目标编码器为第一转换transformer层,所述目标预测网络为第二transformer层。
在一种可能的实现中,所述第一transformer层包括串行的多个子transformer层;所述通过目标编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,包括:
通过每个所述子transformer层,对与所述子transformer层相邻的上一个子transformer层输出的数据进行处理,以得到M个中间向量,并将所述M个中间向量输出至与所述子transformer层相邻的下一个子transformer层;其中,若所述子transformer层为所述多个子transformer层中最靠近输入侧的transformer层,则所述子transformer层的输入数据为所述M个第一嵌入向量;若所述子transformer层为所述多个子transformer层中最靠近输出侧的transformer层,则所述子transformer层输出的数据为所述M个第一输出向量。
在一种可能的实现中,所述目标编码器包括注意力头,所述编码模块,用于获取注意力信息,所述注意力信息用于指示所述注意力头在处理所述M个第一嵌入向量时,所述M个第一嵌入向量中任意两个第一嵌入向量之间存在注意力关联;
根据所述注意力信息,通过所述目标编码器,对所述M个第一嵌入向量进行处理。
在一种可能的实现中,所述装置还包括:
嵌入模块,用于通过嵌入层对所述目标数据中的M个已知数据单元进行嵌入处理,以得到M个第三嵌入向量;
获取所述M个已知数据单元中的每个已知数据单元的位置向量,所述位置向量用于指示所述第一位置;
将所述M个第三嵌入向量中的每个第三嵌入向量与对应的位置向量进行融合,以得到所述M个第一嵌入向量。
在一种可能的实现中,所述目标数据还包括第二待预测数据单元,且所述第二待预测数据单元与所述第一待预测数据单元的被预测先后顺序为随机确定的。
在一种可能的实现中,若所述第二待预测数据单元在所述第一待预测数据单元之后被预测,则所述方法还包括:
获取第四嵌入向量和第五嵌入向量,所述第四嵌入向量用于表示所述第一待预测数据单元以及所述第一待预测数据单元在所述目标数据中的第二位置,所述第五嵌入向量用于表示所述第二待预测数据单元在所述目标数据中的第三位置;
通过所述目标编码器,对所述M个第一嵌入向量以及所述第四嵌入向量进行处理,以得到M个已知数据单元以及所述第一待预测数据单元对应的M+1个第二输出向量;
通过所述目标预测网络,对所述M+1个第二输出向量以及所述第五嵌入向量进行处理,以得到所述第二待预测数据单元。
在一种可能的实现中,所述每个已知数据单元对应的第二输出向量为根据所述M个第一嵌入向量生成的;所述第一待预测数据单元对应的第二输出向量为根据所述M个第一嵌入向量以及所述第四嵌入向量生成的。
在一种可能的实现中,所述目标数据为文本数据,所述已知数据单元为所述文本数据中的已知词,所述第一待预测数据单元为所述文本数据中的待预测词;或者,
所述目标数据为语音数据,所述已知数据单元为所述语音数据中的已知音频序列,所述第一待预测数据单元为所述语音数据中的待预测音频序列;或者,
所述目标数据为图像数据,所述已知数据单元为所述图像数据中的已知像素点,所述第一待预测数据单元为所述图像数据中的待预测像素点。
第五方面,本申请提供了一种数据处理装置,包括:
获取模块,用于获取M个第一嵌入向量、以及第二嵌入向量;其中,每个第一嵌入向量用于表示目标数据中的一个数据单元以及所述一个数据单元在所述目标数据中的第一位置,所述第二嵌入向量用于指示目标处理任务;所述M为正整数;
编码模块,用于通过目标编码器,对所述M个第一嵌入向量进行处理,以得到M个 数据单元对应的M个输出向量,其中,每个所述数据单元对应的输出向量为根据所述M个第一嵌入向量生成的;
任务处理模块,用于通过任务网络,对所述M个输出向量以及所述第二嵌入向量进行所述目标处理任务对应的处理,以得到任务处理结果。
在一种可能的实现中,所述第一位置用于指示所述数据单元与其他数据单元之间的相对位置关系。
在一种可能的实现中,所述目标编码器为第一转换transformer层,所述任务网络为第二transformer层。
在一种可能的实现中,所述第一transformer层包括串行的多个子transformer层;所述通过目标编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,包括:
通过每个所述子transformer层,对与所述子transformer层相邻的上一个子transformer层输出的数据进行处理,以得到M个中间向量,并将所述M个中间向量输出至与所述子transformer层相邻的下一个子transformer层;其中,若所述子transformer层为所述多个子transformer层中最靠近输入侧的transformer层,则所述子transformer层的输入数据为所述M个第一嵌入向量;若所述子transformer层为所述多个子transformer层中最靠近输出侧的transformer层,则所述子transformer层输出的数据为所述M个输出向量。
在一种可能的实现中,所述目标编码器包括注意力头,所述编码模块,用于获取注意力信息,所述注意力信息用于指示所述注意力头在处理所述M个第一嵌入向量时,所述M个第一嵌入向量中任意两个第一嵌入向量之间存在注意力关联;
根据所述注意力信息,通过所述目标编码器,对所述M个第一嵌入向量进行处理。
在一种可能的实现中,所述目标数据为文本数据,所述数据单元为所述文本数据中的词;或者,
所述目标数据为语音数据,所述已知数据单元为所述语音数据中的音频序列;或者,
所述目标数据为图像数据,所述已知数据单元为所述图像数据中的像素点。
在一种可能的实现中,所述目标处理任务包括短文本分类、长文本分类、自然语言推断、文本相似度匹配或文本情感分类。
第六方面,本申请提供了一种数据处理装置,包括:
获取模块,用于获取第一编码器、第一预测网络、M个第一嵌入向量、以及第二嵌入向量;其中,每个第一嵌入向量用于表示目标数据中的一个已知数据单元以及所述一个已 知数据单元在所述目标数据中的第一位置,所述第二嵌入向量用于表示所述目标数据中的第一待预测数据单元在所述目标数据中的第二位置;所述M为正整数;
编码模块,用于通过所述第一编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,其中,每个所述已知数据单元对应的第一输出向量为根据所述M个第一嵌入向量生成的;
预测模块,用于通过所述第一预测网络,对所述M个第一输出向量以及所述第二嵌入向量进行处理,以得到第三预测数据单元;
模型训练模块,用于基于所述第三预测数据单元与所述第一待预测数据单元之间的差异,更新所述第一编码器和所述第一预测网络,以得到目标编码器和目标预测网络。
在一种可能的实现中,所述第一位置用于指示所述已知数据单元与其他已知数据单元以及所述已知数据单元与所述第一待预测数据单元之间的相对位置关系;所述第二位置用于指示所述第一待预测数据单元与所述目标数据中各个已知数据单元之间的相对位置关系。
在一种可能的实现中,所述第一编码器为第一转换transformer层,所述第一预测网络为第二transformer层。
在一种可能的实现中,所述第一transformer层包括串行的多个子transformer层;所述通过所述第一编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,包括:
通过每个所述子transformer层,对与所述子transformer层相邻的上一个子transformer层输出的数据进行处理,以得到M个中间向量,并将所述M个中间向量输出至与所述子transformer层相邻的下一个子transformer层;其中,若所述子transformer层为所述多个子transformer层中最靠近输入侧的transformer层,则所述子transformer层的输入数据为所述M个第一嵌入向量;若所述子transformer层为所述多个子transformer层中最靠近输出侧的transformer层,则所述子transformer层输出的数据为所述M个第一输出向量。
在一种可能的实现中,所述目标数据还包括第二待预测数据单元,且所述第二待预测数据单元与所述第一待预测数据单元的被预测先后顺序为随机确定的。
在一种可能的实现中,若所述第二待预测数据单元在所述第一待预测数据单元之后被预测,则所述方法还包括:
获取第四嵌入向量和第五嵌入向量,所述第四嵌入向量用于表示所述第一待预测数据单元以及所述第一待预测数据单元在所述目标数据中的第二位置,所述第五嵌入向量用于表示所述目标数据中的第二待预测数据单元在所述目标数据中的第三位置;
通过所述第一编码器,对所述M个第一嵌入向量以及所述第四嵌入向量进行处理,以得到M个已知数据单元以及所述第一待预测数据单元对应的M+1个第二输出向量;
通过所述第一预测网络,对所述M+1个第二输出向量以及所述第五嵌入向量进行处理,以得到所述第四待预测数据单元;
所述基于所述第三预测数据单元与所述第一待预测数据单元之间的差异,更新所述第一编码器和所述第一预测网络,以得到目标编码器和目标预测网络,包括:
基于所述第三预测数据单元与所述第一待预测数据单元之间的差异、以及所述第四预测数据单元与所述第二待预测数据单元之间的差异,更新所述第一编码器和所述第一预测网络,以得到目标编码器和目标预测网络。
在一种可能的实现中,所述每个已知数据单元对应的第二输出向量为根据所述M个第一嵌入向量生成的;所述第一待预测数据单元对应的第二输出向量为根据所述M个第一嵌入向量以及所述第四嵌入向量生成的。
在一种可能的实现中,所述目标数据为文本数据,所述已知数据单元为所述文本数据中的已知词,所述第一待预测数据单元为所述文本数据中的待预测词;或者,
所述目标数据为语音数据,所述已知数据单元为所述语音数据中的已知音频序列,所述第一待预测数据单元为所述语音数据中的待预测音频序列;或者,
所述目标数据为图像数据,所述已知数据单元为所述图像数据中的已知像素点,所述第一待预测数据单元为所述图像数据中的待预测像素点。
第七方面,本申请实施例提供了一种执行设备,可以包括存储器、处理器以及总线***,其中,存储器用于存储程序,处理器用于执行存储器中的程序,以执行如上述第一方面及其任一可选的方法、第二方面及其任一可选的方法。
第八方面,本申请实施例提供了一种训练设备,可以包括存储器、处理器以及总线***,其中,存储器用于存储程序,处理器用于执行存储器中的程序,以执行如上述第三方面及其任一可选的方法。
第九方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面及其任一可选的方法、第二方面及其任一可选的方法、第三方面及其任一可选的方法。
第十方面,本申请实施例提供了一种计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面及其任一可选的方法、第二方面及其任一可选的方法、第三方面及其任一可选的方法。
第十一方面,本申请提供了一种芯片***,该芯片***包括处理器,用于支持执行设备或训练设备实现上述方面中所涉及的功能,例如,发送或处理上述方法中所涉及的数据;或,信息。在一种可能的设计中,所述芯片***还包括存储器,所述存储器,用于保存执行设备或训练设备必要的程序指令和数据。该芯片***,可以由芯片构成,也可以包括芯片和其他分立器件。
本申请实施例提供了一种数据处理方法,所述方法包括:获取M个第一嵌入向量、以 及第二嵌入向量;其中,每个第一嵌入向量用于表示目标数据中的一个已知数据单元以及所述一个已知数据单元在所述目标数据中的第一位置,所述第二嵌入向量用于表示所述目标数据中的第一待预测数据单元在所述目标数据中的第二位置;所述M为正整数;通过目标编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,其中,每个所述已知数据单元对应的第一输出向量为根据所述M个第一嵌入向量生成的;通过目标预测网络,对所述M个第一输出向量以及所述第二嵌入向量进行处理,以得到所述第一待预测数据单元。通过上述方式,针对于M个已知数据单元对应的M个第一嵌入向量,目标编码器可以将M个第一嵌入向量作为输入,其中第一嵌入向量包括了位置信息和已知数据单元的数据信息,而不需要再单独设置额外的M个位置信息来作为目标编码器的输入,此外,目标编码器的中间输出的隐变量的数量也和输入的嵌入向量的数量保持一致,减少了目标编码器的计算量和内存消耗。
附图说明
图1为人工智能主体框架的一种结构示意图;
图2为一种自然语言处理***;
图3a为另一种自然语言处理***;
图3b为一种***的结构示意图;
图4为本申请实施例提供的自然语言处理的相关设备的示意图;
图5为一种transformer层的架构示意;
图6a为本申请实施例提供的一种数据处理方法的实施例示意;
图6b为一种数据处理方法的实施例示意;
图6c为本申请实施例提供的一种数据处理方法的实施例示意;
图7为本申请实施例中的一种神经网络模型的结构示意;
图8为一种transformer层的结构示意;
图9为一个注意力头head的操作示意图;
图10至图19为本申请实施例提供的一种数据处理方法的实施例示意图;
图20为本申请实施例提供的数据处理装置的一种结构示意图;
图21为本申请实施例提供的数据处理装置的一种结构示意图;
图22为本申请实施例提供的数据处理装置的一种结构示意图;
图23为本申请实施例提供的执行设备的一种结构示意图;
图24为本申请实施例提供的训练设备一种结构示意图;
图25为本申请实施例提供的芯片的一种结构示意图。
具体实施方式
下面结合本发明实施例中的附图对本发明实施例进行描述。本发明的实施方式部分使用的术语仅用于对本发明的具体实施例进行解释,而非旨在限定本发明。
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的 发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、***、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
首先对人工智能***总体工作流程进行描述,请参见图1,图1示出的为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到***的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施
基础设施为人工智能***提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算***中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有***的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能***中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用***,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能***在各领域的产品和应用,是对人工智能整体解决 方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能终端、智能交通、智能医疗、自动驾驶、智慧城市等。
本申请可以应用于人工智能领域的自然语言处理领域、图像处理领域以及音视频处理领域中,下面以自然语言处理为例将对多个落地到产品的多个应用场景进行介绍。
为了更好地理解本申请实施例的方案,下面先结合图1至图3a对本申请实施例可能的应用场景进行简单的介绍。
图2示出了一种自然语言处理***,该自然语言处理***包括用户设备以及数据处理设备。其中,用户设备包括手机、个人电脑或者信息处理中心等智能终端。用户设备为自然语言数据处理的发起端,作为语言问答或者查询等请求的发起方,通常用户通过用户设备发起请求。
上述数据处理设备可以是云服务器、网络服务器、应用服务器以及管理服务器等具有数据处理功能的设备或服务器。数据处理设备通过交互接口接收来自智能终端的查询语句/语音/文本等,再通过存储数据的存储器以及数据处理的处理器环节进行机器学习,深度学习,搜索,推理,决策等方式的语言数据处理,并将处理结果反馈至用户设备。数据处理设备中的存储器可以是一个统称,包括本地存储以及存储历史数据的数据库,数据库可以在数据处理设备上,也可以在其它网络服务器上。
在图2所示的自然语言处理***中,用户设备可以接收用户的指令,例如用户设备可以接收用户输入的一段文本,然后向数据处理设备发起请求,使得数据处理设备针对用户设备得到的该一段文本执行自然语言处理应用(例如自然语言生成、文本分类、文本推理、命名实体识别、翻译等),从而得到针对该一段文本的对应的自然语言处理应用的处理结果(例如预测词结果、分类结果、推理结果、命名实体识别结果、翻译结果等)。
以自然语言生成为例,自然语言生成(natural language generation)也可以称之为文本预测任务或者自然语言合成任务,是指在给定一段文字的前提下,生成其中的缺失文本或者后续文本的任务。自然语言生成在搜索引擎,输入法等场景均有广泛应用,可以在用户输入部分文字的前提下预测用户接下来的输入,可以大大提高用户的使用该产品的效率,此外还可以对存在文字缺失的文本进行恢复。示例性的,本申请实施例中,用户设备可以接收用户输入的一段文本数据(例如本申请实施例中描述的目标数据),其中文本数据中包括已知词和待预测词,待预测词不可见,仅仅知晓待预测词在文本数据中的位置,然后用户设备可以向数据处理设备发起请求(请求中携带文本数据),使得数据处理设备对该文本数据中的待预测词进行预测,从而得到待预测词,并将待预测词反馈至用户设备。
示例性的,用户设备可以接收用户输入的一段文本数据,然后向数据处理设备发起请求,使得数据处理设备对该一段文本数据进行实体分类,从而得到针对该一段文本数据的实体分类结果,并将实体分类结果反馈至用户设备;
示例性的,用户设备可以接收用户输入的一段文本数据(文本数据为中文文本),然后向数据处理设备发起请求,使得数据处理设备将该一段文本数据翻译成英文,从而得到针对该一段文本数据的英文译文,并将英文译文反馈至用户设备。
在图2中,数据处理设备可以通过本申请实施例的数据处理方法来处理上述文本数据。
图3a示出了另一种自然语言处理***,在图3a中,用户设备直接作为数据处理设备,该用户设备能够直接接收来自用户的输入并直接由用户设备本身的硬件进行处理,具体过程与图2相似,可参考上面的描述,在此不再赘述。
图4是本申请实施例提供的自然语言处理的相关设备300的示意图。
上述图2和图3a中的用户设备具体可以是图4中的本地设备301或者本地设备302,图2中的数据处理设备具体可以是图4中的执行设备310,其中,数据存储***350可以存储执行设备310的待处理数据,数据存储***350可以集成在执行设备310上,也可以设置在云上或其它网络服务器上。
图2和图3a中的处理器可以通过神经网络模型或者其它模型进行数据训练/机器学习/深度学习,并利用数据最终训练或者学习得到的模型(例如本申请实施例中的目标编码器、目标预测网络、任务网络等等)针对文本数据(例如本申请实施例中描述的目标数据)执行自然语言处理应用(例如自然语言生成、文本分类、序列标注、阅读理解、文本生成、文本推理、翻译等),从而得到相应的处理结果(例如本申请实施例中的第一待预测数据单元、第二待预测数据单元以及任务处理结果等等)。
应理解,本申请实施例还可以应用在图像处理领域以及音视频处理领域中,则上述数据处理设备通过本申请实施例的数据处理方法来处理目标数据。
应理解,上述数据处理设备在后续实施例中也可以称之为数据处理装置、执行设备、服务器、终端设备等。
下面结合图3b对本申请实施例提供的***架构进行详细的介绍。图3b为本申请一实施例提供的***架构示意图。如图3b所示,***架构500包括执行设备510、训练设备520、数据库530、客户设备540、数据存储***550以及数据采集***560。
执行设备510包括计算模块511、I/O接口512、预处理模块513和预处理模块514。计算模块511中可以包括目标模型/规则501,预处理模块513和预处理模块514是可选的。
数据采集设备560用于采集训练数据。其中,在自然语言合成的任务中,训练数据可以为存在文本缺失的文本数据以及该存在文本缺失的文本数据对应的完整文本数据;在音频合成的任务中,训练数据可以为存在音频序列缺失的语音数据以及该存在音频序列缺失的语音数据对应的完整语音数据;在图像合成(或者称之为图像重建)的任务中,训练数据可以为存在像素缺失的图像数据或视频数据以及该存在像素缺失的图像数据或视频数据对应的完整图像数据或视频数据。在采集到训练数据之后,数据采集设备560将这些训练数据存入数据库530,训练设备520基于数据库530中维护的训练数据训练得到目标模型/规则501。
以目标模型/规则501用于实现自然语言合成任务为例,上述目标模型/规则501(例如本申请实施例中的目标编码器、目标预测网络)能够用于实现自然语言合成任务,即,将存在文本缺失的文本数据输入该目标模型/规则501,即可得到缺失的文本(例如本申请实施例中的第一待预测数据单元以及第二待预测数据单元)。
以目标模型/规则501用于实现目标处理任务(例如短文本分类、长文本分类、自然语言推断、文本相似度匹配、文本情感分类等)为例,上述目标模型/规则501(例如本申请实施例中的目标编码器、任务网络)能够用于实现目标处理任务,即,将目标数据输入该目标模型/规则501,即可得到任务处理结果。
需要说明的是,在实际应用中,数据库530中维护的训练数据不一定都来自于数据采集设备560的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备520也不一定完全基于数据库530维护的训练数据进行目标模型/规则501的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备520训练得到的目标模型/规则501可以应用于不同的***或设备中,如应用于图3b所示的执行设备510,所述执行设备510可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备,车载终端等,还可以是服务器或者云端等。在图3b中,执行设备510配置输入/输出(input/output,I/O)接口512,用于与外部设备进行数据交互,用户可以通过客户设备540向I/O接口512输入数据(例如本申请实施例中的目标数据)。
预处理模块513和预处理模块514用于根据I/O接口512接收到的输入数据进行预处理(例如获取已知数据单元以及待预测数据单元在目标数据中的位置、或者生成注意力信息等预处理过程)。应理解,可以没有预处理模块513和预处理模块514或者只有的一个预处理模块。当不存在预处理模块513和预处理模块514时,可以直接采用计算模块511对输入数据进行处理。
在执行设备510对输入数据进行预处理,或者在执行设备510的计算模块511执行计算等相关的处理过程中,执行设备510可以调用数据存储***550中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储***550中。
最后,I/O接口512将处理结果,如处理后得到的缺失文本、缺失音频序列、缺失像素(例如本申请实施例中的第一待预测数据单元、第二待预测数据单元、任务处理结果)呈现给客户设备540,从而提供给用户。
在图3b所示情况下,用户可以手动给定输入数据,该“手动给定输入数据”可以通过I/O接口512提供的界面进行操作。另一种情况下,客户设备540可以自动地向I/O接口512发送输入数据,如果要求客户设备540自动发送输入数据需要获得用户的授权,则用户可以在客户设备540中设置相应权限。用户可以在客户设备540查看执行设备510输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备540也可以作为数据采集端,采集如图所示输入I/O接口512的输入数据及输出I/O接口512的输出结果作为新的样本数据,并存入数据库530。当然,也可以不经过客户设备540进行采集,而是由I/O接口512直接将如图所示输入I/O接口512的输入数据及输出I/O接口512的输出结果,作为新的样本数据存入数据库530。
值得注意的是,图3b仅是本申请实施例提供的一种***架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图3b中,数据存储***550相对 执行设备510是外部存储器,在其它情况下,也可以将数据存储***550置于执行设备510中。
应理解上述执行设备510也可以部署于客户设备540中。
从模型推理的角度,本申请实施例中,上述数据存储***550中可以存储有用于实现本申请实施例中的数据处理方法相关的代码,计算模块511可以从数据存储***550中获取到上述用于实现本申请实施例中的数据处理方法相关的代码,以执行本申请实施例中的数据处理方法。
本申请实施例中,计算模块511可以包括硬件电路(如专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)、通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器等等)、或这些硬件电路的组合,例如,计算模块511可以为具有执行指令功能的硬件***,如CPU、DSP等,或者为不具有执行指令功能的硬件***,如ASIC、FPGA等,或者为上述不具有执行指令功能的硬件***以及具有执行指令功能的硬件***的组合。
具体的,计算模块511可以为具有执行指令功能的硬件***,本申请实施例提供的数据处理方法可以为存储在数据存储***550中的软件代码,计算模块511可以从数据存储***550中获取到软件代码,并执行获取到的软件代码来实现本申请实施例提供的数据处理方法。
应理解,计算模块511可以为不具有执行指令功能的硬件***以及具有执行指令功能的硬件***的组合,本申请实施例提供的数据处理方法的部分步骤还可以通过计算模块511中不具有执行指令功能的硬件***或者预处理模块513、预处理模块514来实现,这里并不限定。
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以xs(即输入数据)和截距1为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2022087028-appb-000001
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)transformer层
参照图5,图5为一种transformer层的架构示意,如图5所示,神经网络包括嵌入层和至少一个transformer层,至少一个transformer层可以为N个transformer层(N大于0的整数),其中,每个transformer层包括依次相邻的注意力层、加和与归一化(add&norm)层、前馈(feed forward)层和加和与归一化层。在嵌入层,对当前输入进行嵌入处理,得到多个嵌入向量;在所述注意力层,从所述第一transformer层的上一层获取P个输入向量,以P个输入向量中的任意的第一输入向量为中心,基于预设的注意力窗口范围内的各个输入向量与该第一输入向量之间的关联度,得到该第一输入向量对应的中间向量,如此确定出P个输入向量对应的P个中间向量;在所述池化层,将所述P个中间向量合并为Q个输出向量,其中transformer层中最后一个transformer层得到的多个输出向量用作所述当前输入的特征表示。
(3)注意力机制(attention mechanism)
注意力机制模仿了生物观察行为的内部过程,即一种将内部经验和外部感觉对齐从而增加部分区域的观察精细度的机制,能够利用有限的注意力资源从大量信息中快速筛选出高价值信息。注意力机制可以快速提取稀疏数据的重要特征,因而被广泛用于自然语言处理任务,特别是机器翻译。而自注意力机制(self-attention mechanism)是注意力机制的改进,其减少了对外部信息的依赖,更擅长捕捉数据或特征的内部相关性。注意力机制的本质思想可以改写为如下公式:
其中,Lx=||Source||代表Source的长度,公式含义即将Source中的构成元素想象成是由一系列的数据对构成,此时给定目标Target中的某个元素Query,通过计算Query和各个Key的相似性或者相关性,得到每个Key对应Value的权重系数,然后对Value进行加权求和,即得到了最终的Attention数值。所以本质上Attention机制是对Source中元素的Value值进行加权求和,而Query和Key用来计算对应Value的权重系数。从概念上理解,把Attention可以理解为从大量信息中有选择地筛选出少量重要信息并聚焦到这些重要信息上,忽略大多不重要的信息。聚焦的过程体现在权重系数的计算上,权重越大越聚焦于其对应的Value值上,即权重代表了信息的重要性,而Value是其对应的信息。自注意力机制可以理解为内部Attention(intra attention),Attention机制发生在Target的元素Query和Source中的所有元素之间,自注意力机制指的是在Source内部元素之间或者Target内部元素之间发生的Attention机制,也可以理解为Target=Source这种特殊情况下的注意力计算机制,其具体计算过程是一样的,只是计算对象发生了变化而已。
(4)自然语言处理(natural language processing,NLP)
自然语言(natural language)即人类语言,自然语言处理(NLP)就是对人类语言的处理。自然语言处理是以一种智能与高效的方式,对文本数据进行***化分析、理解与信息提取的过程。通过使用NLP及其组件,我们可以管理非常大块的文本数据,或者执行大量的自动化任务,并且解决各式各样的问题,如自动摘要(automatic summarization),机器翻译(machine translation,MT),命名实体识别(named entity recognition,NER),关系提取(relation extraction,RE),信息抽取(information extraction,IE),情感分析,语音识别(speech recognition),问答***(question answering)以及主题分割等等。
(5)预训练语言模型(pre-trained language model)
预训练语言模型是一个自然语言序列编码器,为自然语言序列中的每个词进行编码成为一个向量表示,从而进行预测任务。它的训练包含两个阶段。在预训练(pre-training)阶段,该模型在大规模无监督文本上进行语言模型任务的训练,从而学习到一个词表示。在微调(finetuning)阶段,该模型利用预训练阶段学到的参数做初始化,在文本分类(text classification),序列标注(sequence labeling)等下游任务(downstream task)上进行较少步骤的训练,就可以成功把预训练得到的语义信息成功迁移到下游任务上来。
(6)自回归语言模型(autoregressive language model)
自回归语言模型是指能够根据给定的上下文(如“手机很”)预测下一个可能跟随的词(如“不错”)的模型,该模型通常是给定左侧上文预测右侧下文中的词,但也可以是给定左侧和右侧的上下文预测中间的某个词。
首先以模型推理的阶段为例对本申请实施例提供的数据处理方法进行说明。
参照图6a,图6a为本申请实施例提供的一种数据处理方法的实施例示意,本申请实施例提供的一种数据处理方法可以应用在上述描述的数据处理设备、执行设备中,具体的,数据处理方法可以应用在手机、平板、笔记本电脑、智能穿戴设备等终端设备上,或者应用在云侧的服务器上,如图6a示出的那样,本申请实施例提供的一种数据处理方法,包括:
601、获取M个第一嵌入向量、以及第二嵌入向量;其中,每个第一嵌入向量用于表示目标数据中的一个已知数据单元以及所述一个已知数据单元在所述目标数据中的第一位置,所述第二嵌入向量用于表示所述目标数据中的第一待预测数据单元在所述目标数据中的第二位置;所述M为正整数。
其中,目标数据为存在数据缺失的数据,其中,目标数据包括未缺失的数据(本申请实施例中称之为已知数据单元)以及缺失的数据(本申请实施例中称之为待预测数据单元,例如第一待预测数据单元以及第二待预测数据单元)。其中,已知数据单元为未缺失的数据中的数据单元,例如所述目标数据可以为文本数据,则目标数据中的已知数据单元可以为文本数据中已知的单词或者是已知的字,待预测数据单元可以为所述文本数据中的待预测词或者是待预测字。例如所述目标数据可以为语音数据,则目标数据中的已知数据单元可以为语音数据中的已知音频序列,待预测数据单元可以为所述语音数据中的待预测音频序列。例如所述目标数据可以为图像数据,则目标数据中的已知数据单元可以为语音数据中的已知像素点,待预测数据单元可以为所述语音数据中的待预测像素点。应理解,已知数据单元和待预测数据单元的数据粒度和目标数据的类型有关,已知数据单元和待预测数据单元的数据粒度可以是目标数据中最小的数据单元或者是由最小的数据单元组成的多个数据单元,这里并不限定已知数据单元和待预测数据单元的粒度。
具体的,本申请实施例中,目标数据中可以包括M个已知数据单元,以及至少一个待预测数据单元(包括第一待预测数据单元),其中待预测数据单元为目标数据中不可见的数据,需要通过M个已知数据单元来确定待预测数据单元。
以目标数据为文本数据为例,本申请实施例中,文本数据中可以包括M个已知词,以 及至少一个待预测词(包括第一待预测词)。其中文本数据可以为中文文本,也可以为英文文本,还可以为其他语言文本,文本数据可以为句子、段落、篇章等。
例如,目标数据可以为“__sat on the mat”,其中“sat”、“on”、“the”和“mat”为已知数据单元,“_”和“_”在目标数据中不可见,为待预测数据单元;应理解,这里的符号“_”的含义是空,而不是指下划线。
本申请实施例中,可以获取M个第一嵌入向量;其中,每个第一嵌入向量用于表示目标数据中的一个已知数据单元以及所述一个已知数据单元在所述目标数据中的第一位置。
接下来首先描述如何生成M个第一嵌入向量:
在一种实现中,可以通过嵌入层对所述目标数据中的M个已知数据单元进行嵌入处理,以得到M个第三嵌入向量。
其中嵌入层可以称为输入嵌入(input embedding)层。当前输入可以为M个已知数据单元。嵌入层在获取当前输入后,可以对该当前输入中各个已知数据单元进行嵌入处理,可得到各个已知数据单元对应的嵌入向量(也就是第三嵌入向量)。
在一些实施例中,还可以获取所述M个已知数据单元中的每个已知数据单元的位置向量,所述位置向量用于指示所述第一位置;其中,第一位置用于表示已知数据单元在目标数据中的位置,具体的,所述第一位置用于指示所述已知数据单元与其他已知数据单元以及所述已知数据单元与所述第一待预测数据单元之间的相对位置关系。
在一种实现中,所述嵌入层可以包括输入嵌入层和位置编码(positional encoding)层。在输入嵌入层,可以对当前输入中的各个已知数据单元进行词嵌入处理,从而得到各个已知数据单元的第三嵌入向量。在位置编码层,可以获取各个已知数据单元在该当前输入中的位置,进而对各个已知数据单元的位置生成位置向量。
在一些示例中,各个已知数据单元在目标数据中的第一位置可以为各个已知数据单元在目标数据中的绝对位置。以当前输入为“几号应还花呗”为例,其中的“几”的位置可以表示为第一位,“号”的位置可以表示为第二位,……。在一些示例中,各个已知数据单元在目标数据中的第一位置可以为各个已知数据单元在目标数据中的相对位置。仍以当前输入为“几号应还花呗”为例,其中的“几”的位置可以表示为“号”之前,“号”的位置可以表示为“几”之后、“应”之前,……。当得到当前输入中各个已知数据单元的第三嵌入向量和位置向量时,可以将各个已知数据单元的位置向量和对应的第三嵌入向量进行融合,得到各个已知数据单元的第一嵌入向量,即得到该当前输入对应的多个第一嵌入向量。应理解,融合的方式可以是对第三嵌入向量和位置向量进行加法运算,或者是通过其他运算使得第一嵌入向量可以携带目标数据中的一个已知数据单元以及所述一个已知数据单元在所述目标数据中的第一位置的信息,这里并不限定具体的融合方式。多个第一嵌入向量可以表示为具有预设维度的嵌入矩阵。可以设定该多个第一嵌入向量的个数为M,预设维度为H维,则该多个第一嵌入向量可以表示为M×H的嵌入矩阵。
本申请实施例中,可以获取第二嵌入向量,其中所述第二嵌入向量用于表示所述目标数据中的第一待预测数据单元在所述目标数据中的第二位置,其中,第二位置可以用于指示所述第一待预测数据单元与所述目标数据中各个已知数据单元之间的相对位置关系。
接下来描述如何生成第二嵌入向量:
在一种实现中,可以通过嵌入层对第一待预测数据单元在所述目标数据中的第二位置进行嵌入处理,以得到用于表示所述目标数据中的第一待预测数据单元在所述目标数据中的第二位置的第二嵌入向量,该第二嵌入向量可以作为后续目标预测网络的输入。其中,所述第二位置用于指示所述第一待预测数据单元与所述目标数据中各个已知数据单元之间的相对位置关系,关于第二位置的描述可以参照上述实施例中关于第一位置的描述,相似之处这里不再赘述。
进而,可以获取到针对于M个已知数据单元的M个第一嵌入向量,以及针对于第一待预测数据单元的第二嵌入向量。
602、通过目标编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,其中,每个所述已知数据单元对应的第一输出向量为根据所述M个第一嵌入向量生成的。
本申请实施例中,目标编码器可以对M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,也就是可以得到每个已知数据单元对应的一个第一输出向量。
在现有的自然语言生成模型(参照图6b)中,将自编码模型和自回归语言模型进行了融合,相比自编码模型和自回归语言模型,该模型将隐状态数量扩大了一倍,其中白色部分对应自编码模型,灰色部分对应自回归语言模型,自编码模型相关的隐变量用于表示位置信息,自回归模型用于提供自编码语言模型预测的上下文信息,该模型的计算量和内存消耗是自编码和自回归模型的两倍。
本申请实施例中,在目标编码器处理M个第一嵌入向量的过程中,隐状态数量和自编码和自回归语言模型中的隐状态数量一致,具体的,针对于M个已知数据单元对应的M个第一嵌入向量,目标编码器可以将M个第一嵌入向量作为输入,其中第一嵌入向量包括了位置信息和已知数据单元的数据信息,而不需要再单独设置额外的M个位置信息来作为目标编码器的输入,此外,目标编码器的中间输出的隐变量的数量也和输入的嵌入向量的数量保持一致,减少了目标编码器的计算量和内存消耗。
具体可以参照图6c,目标编码器的输入为M个第一嵌入向量,输出为M个第一输出向量。
本申请实施例中,每个第一输出向量为基于M个第一嵌入向量得到的。
所谓每个第一输出向量为基于M个第一嵌入向量得到的,可以理解为每个第一输出向量可以将M个第一嵌入向量作为参考,也就是在生成每个第一输出向量时,各个第一嵌入向量都是可见的,或者,每个第一输出向量与M个第一嵌入向量存在依赖关系。
在一种实现中,所述目标编码器可以为第一转换transformer层,则每个第一输出向量为基于M个第一嵌入向量得到可以理解为M个第一嵌入向量中任意两个第一嵌入向量之间存在注意力关联。
其中,参照图7,所述第一transformer层可以包括串行的多个子transformer层,所述 N个子transformer层包括相邻的第一子transformer层和第二子transformer层,也就是说第一子transformer层和第二子transformer层可以为第一transformer层中任意相邻的两个子transformer层。
可以通过每个所述子transformer层,对与所述子transformer层相邻的上一个子transformer层输出的数据进行处理,以得到M个中间向量,并将所述M个中间向量输出至与所述子transformer层相邻的下一个子transformer层;其中,若所述子transformer层为所述多个子transformer层中最靠近输入侧的transformer层,则所述子transformer层的输入数据为所述M个第一嵌入向量;若所述子transformer层为所述多个子transformer层中最靠近输出侧的transformer层,则所述子transformer层输出的数据为所述M个第一输出向量。
也就是说,每个子transformer层的输入包括与M个已知数据单元对应的M个特征向量,每个子transformer层的输出包括与M个已知数据单元对应的M个输出向量。进而使得目标编码器的中间输出的隐变量的数量也和输入的嵌入向量的数量保持一致,减少了目标编码器的计算量和内存消耗。
也就是说,每个子transformer层的输入包括与M个已知数据单元对应的M个特征向量,每个子transformer层的输出包括与M个已知数据单元对应的M个输出向量。进而使得目标编码器的中间输出的隐变量的数量也和输入的嵌入向量的数量保持一致,减少了目标编码器的计算量和内存消耗。
transformer层的核心特点在于其采用的独特的注意力机制。在处理自然语言,例如一个句子时,transformer模型利用该注意力机制,为句子中各个词的嵌入向量赋予不同的注意力系数,从而更全面地考虑句子中上下文对各个词的影响。具体的transformer层可以包括依次相邻的多头注意力层、加和与归一化(add&norm)层、前馈(feed forward)层、加和与归一化层。其中,注意力层与嵌入层相连,从嵌入层获取M个嵌入向量作为输入向量,基于M个嵌入向量中各个嵌入向量之间的关联度,对各个嵌入向量进行综合,得到M个输出向量,输出给后续的transformer层。transformer层获取前一层的输出作为输入向量,执行与前一级transformer层类似的操作。
参照图8,图8为一种transformer层的结构示意,本申请实施例中的各个子transformer层都可以参照图8中示出的结构,如图8中示出的那样,transformer层包括依次相邻的多头注意力层、加和与归一化(add&norm)层、前馈(feed forward)层、加和与归一化层。
其中,多头注意力层从其上一层获取M个输入向量X l,又可以表示为矩阵X,采用自注意力机制,基于向量间的关联度对各个向量进行变换,得到M个输出向量,又可以表示为矩阵Y。可以理解,当该多头注意力层是与嵌入层直接相连的层,例如图7中与嵌入层直连的transformer层,其获取的输入向量即为嵌入层输出的嵌入向量;当该多头注意力层是后续的transformer层包括的多头注意力层,例如图7中与上一级transformer层直连的transformer层包括的多头注意力层,其获取的输入向量即为前一级transformer层的输出向量。在多头注意力层,基于多头注意力(multi-head attention,MHA)的MHA层包括多个注意力头head(如图8中示出的Head 1、Head 2、…、Head N)。
图9为一个注意力头head的操作示意图,该示意图示出注意力头head如何将输入矩阵X 变换为输出矩阵Y。如图9所示,分别采用第一变换矩阵Q,第二变换矩阵K和第三变换矩阵V对M个输入向量<X1,X2,…,XN>中各个输入向量Xi进行变换,得到各个输入向量对应的第一中间向量(q向量),第二中间向量(k向量)和第三中间向量(v向量)。操作上,可以分别用第一变换矩阵Q,第二变换矩阵K和第三变换矩阵V,对N个输入向量构成的输入矩阵X进行线性变换,分别得到输入矩阵的Q矩阵,K矩阵和V矩阵,再分别对矩阵进行拆分,即可得到各个输入向量对应的q向量,k向量和v向量。对于M个输入向量中任意的第i输入向量Xi,基于该第i输入向量对应的第一中间向量(q向量,qi)与各个输入向量Xj对应的各个第二中间向量(k向量,kj)的点乘操作,确定该第i输入向量Xi与各个输入向量Xj的各个关联度。尽管也可以直接将qi与kj的点乘结果确定为关联度,但是更经典地,先将点乘结果除以一常数,然后进行softmax运算,将运算结果作为输入向量Xi与Xj的关联度,即:
Figure PCTCN2022087028-appb-000002
于是,可以以该第i输入向量Xi与各个输入向量Xj的各个关联度αi,j作为权重因子,对各个输入向量Xj对应的第三中间向量(v向量,vj)进行加权组合,得到该第i输入向量Xi对应的第i组合向量Ci:
Figure PCTCN2022087028-appb-000003
于是,可以得到M个输入向量对应的M个组合向量的向量序列<C1,C2,…,CN>,或矩阵C。基于该组合向量序列,可以得到M个输出向量。具体地,在一个实施例中,可以直接将N个组合向量的向量序列作为M个输出向量,即Yi=Ci。此时,输出矩阵Y即为组合向量矩阵C,又可以写成:
Figure PCTCN2022087028-appb-000004
以上为一个注意力头head的处理过程描述,在MHA架构中,MHA层维护m套变换矩阵,每套变换矩阵包括前述第一变换矩阵Q、第二变换矩阵K和第三变换矩阵V,从而可以并行地进行上述操作,得到m个组合向量序列(即m个矩阵C),每个向量序列包括基于一套变换矩阵得到的N个组合向量。在这样的情况下,MHA层将得到的m个组合向量序列进行拼接,得到拼接矩阵;再通过第四变换矩阵W对该拼接矩阵进行变换,得到最终的输出矩阵Y。将该输出矩阵Y拆分即对应于M个输出向量<Y1,Y2,…,YN>。通过以上的操作过程,MHA层基于N个输入向量之间的关联度进行变换操作,得到M个输出向量。
如图8中示出的那样,transformer层可以包括前馈层,其中前馈层包括输入层、中间层intermediate layer以及输出层。如前所述,神经网络模型可以包含多个transformer层。在一个实施例中,上述多个transformer层可以采用残差网络的方式堆叠连接。
本申请实施例中,所述目标编码器包括注意力头,且由于在目标数据中,各个已知数据单元之间是相互可见的,因此在处理M个第一嵌入向量时,所述M个第一嵌入向量中任意两个第一嵌入向量之间存在注意力关联,具体的,可以获取注意力信息,所述注意力信息用于指示所述注意力头在处理所述M个第一嵌入向量时,所述M个第一嵌入向量中任意两个第一嵌入向量之间存在注意力关联,进而可以根据所述注意力信息,通过所述目 标编码器,对所述M个第一嵌入向量进行处理,进而使得每个输出向量与M个第一嵌入向量存在依赖关系。
603、通过目标预测网络,对所述M个第一输出向量以及所述第二嵌入向量进行处理,以得到所述第一待预测数据单元。
本申请实施例中,在得到M个输出向量后,可以将M个输出向量输入到目标预测网络,并通过目标预测网络,对M个第一输出向量以及所述第二嵌入向量进行处理,以得到所述第一待预测数据单元。其中,所述目标预测网络可以为transformer层。
目标预测网络可以将M个所述第一输出向量以及所述第二嵌入向量作为输入,得到第一待预测数据单元的向量表示,应理解,第一待预测数据单元的向量表示可以再经过一个分类器(例如可采用支持向量机,softmax分类器,K-近邻算法等)来恢复出第一待预测数据单元。
以文本数据为例,在目标预测网络的数据处理过程中,第一待预测词可见自己对应的位置向量(第二嵌入向量)以及各个已知词(第一嵌入向量),进而可以目标预测网络可以将M个所述第一输出向量以及所述第二嵌入向量作为输入,得到第一待预测词的词向量表示。
示例性的,已知目标数据中位置3到位置6的词为“sat on the mat”,目标是预测句子中的前面两个词,目标预测网络可以首先基于“sat on the mat”对应的4个输入向量以及预测位置1来确定第一待预测词的词为“that”。相似的,目标预测网络接下来基于“that_sat on the mat”来预测位置2的词。
本申请实施例中,所述目标数据还包括第二待预测数据单元;在所述通过目标编码器,对所述M个第一嵌入向量进行处理之前,可以随机确定所述第一待预测数据单元与所述第二待预测数据单元的被预测先后顺序,若所述预测先后顺序用于指示所述第二待预测数据单元在所述第一待预测数据单元之后被预测,则在所述得到所述第一待预测数据单元之后,可以获取第四嵌入向量和第五嵌入向量,所述第四嵌入向量用于表示所述第一待预测数据单元以及所述第一待预测数据单元在所述目标数据中的第二位置,所述第五嵌入向量用于表示所述目标数据中的第二待预测数据单元在所述目标数据中的第三位置,通过所述目标编码器,对所述M个第一嵌入向量以及所述第四嵌入向量进行处理,以得到M个已知数据单元以及所述第一待预测数据单元对应的M+1个第二输出向量,通过所述目标预测网络,对M+1个第二输出向量以及所述第五嵌入向量进行处理,以得到所述第二待预测数据单元。
其中,所述每个已知数据单元对应的第二输出向量为根据所述M个第一嵌入向量生成的;所述第一待预测数据单元对应的第二输出向量为根据所述M个第一嵌入向量以及所述第四嵌入向量生成的。
接下来以文本数据为例,结合一个具体的示例对本申请实施例中的数据处理方法进行描述:
参照图10,预处理模块可以对输入的词向量序列进行处理,并将处理结果输入到自回归词向量编码模块以及查询模块,自回归词向量编码模块以及查询模块的输出结果可以输入到预测模块,预测模块可以输出预测标签。
其中,自回归词向量编码模块可以为上述实施例中的目标编码器,查询模块用于生成第二嵌入向量以及第五嵌入向量,预测模块可以为上述实施例中的目标预测网络,预测标签可以为上述实施例中的待预测数据单元。
参照图11,预处理模块可以对输入的词向量序列进行序列重排、分块以及信息抽取的操作。序列重排即对输入的词向量的顺序进行重构,重构方法包括但不限于:保持原来顺序不变,随机顺序以及倒序,序列分块将重排后的句子分成两块,在后续建模中,第一块的每个词的信息对所有的词都可见,第二块的每个词的信息只对其重排后其后面位置的词可见。信息抽取模块在序列重排的基础上,抽取三部分信息,分别为重排的词向量序列,注意力矩阵(Attention Matrix)(该矩阵定义了在自回归词向量编码模块对每个词的向量表示的建模过程中,句子中哪些词的信息是可见的,该矩阵根据分块得到,注意力矩阵可以为上述实施例中的注意力信息)以及待预测标签的辅助信息(也就是上述实施例中的第二嵌入向量或者第五嵌入向量),辅助信息定义了待预测词在原句中的位置信息。前面两部分信息会输出到自回归词向量编码模块,第三部分会输出到查询模块。应理解,预处理模块的操作由人工定义,不包含任何可学习部分。
自回归词向量编码模块可以学习每个词所对应的上下文信息,最终为句子中的每一个词学习得到一个包含其上下文信息的词向量序列(也就是上述实施例中的输出向量)。
自回归词向量编码模块可以如图12左图所示,该模块包括若干层自回归词向量编码器(其中每一层可以为上述实施例中的子transformer层),每一层接收前一层输出的词向量,并且经过计算词向量间的依赖,将每个词向量的上下文信息融入输出的词向量。图12右图展示了自回归词向量编码器的第i层的计算过程,图中每个方框表示一个词向量,下面一行表示该层输入的词向量,上面一行表示该层输出的词向量,每个箭头ai表示在每个输出词向量对于输入词向量的依赖,该依赖是否存在可以由注意力矩阵决定。
参照图13,在得到包含其上下文信息的词向量序列以及查询信息之后,可以将包含其上下文信息的词向量序列以及查询信息输入到预测模块,由预测模块基于包含其上下文信息的词向量序列以及查询信息进行预测,得到预测标签。
图14展示了一个关于随机预测顺序的实施例。原句为“the cat sat on the mat”,预处理模块将原句进行随机重排,重排后的序列为“the on mat sat the cat”,并且将重排后的句子分块,第一块为“the on mat sat”,该块内的任意一个词都可以被句子中其他所有词看到,第二块为“the cat”,第二块内的任意一个词只能对其后面的词可见(在重排的序列中),第二块的词在本例中将被预测,且由于原句被随机重排,则第二块的词的被预测顺序是随机的。该模块根据句子分块得到注意力矩阵,该矩阵的第i行第j列的元素若为1(白色),表示在后续建模过程中,重排序列中第j个词对第i个词是可见的,反之是不可见的。该模块输出重排的词向量序列(M个第一嵌入向量)以及注意力矩阵(即上述实施例中的注意力信息)输出给自回归词向量编码模块(即上述实施例中的目标编码器),将待预测标签的辅助信息(即上述实施例中的第二嵌入向量指示的第二位置以及第五嵌入向量指示的第三位置,在图14的示例中辅助信息为位置1和位置2,表示该模型将预测原序列中位置1和位置2的词,分别为the和cat)输出到查询模块,查询模块可以生成第二嵌入向量以及第 五嵌入向量。
本申请实施例中采用随机顺序的方式来进行预测,充分利用了待预测数据单元的顺序信息,将顺序信息显式的融入输出向量中。
应理解,以上以文本数据为例描述了待预测词的预测方法,本申请实施例的数据处理方法还可以应用到计算机视觉或者语音领域。具体的,上述目标文本可以替换为一个图片或者语音的序列,该序列相应的经过预处理模块的乱序,分块等操作,获得重排的图片或者语音单元的向量序列以及待预测位置的位置信息,并进入自回归编码模块和查询模块,最终预测模块得到相应的待预测位置的图片或者语音单元。
本申请实施例还可以呈现为云侧的一种服务或者软件的形式,参照图14,该服务或者软件可以具有基于目标数据中的已知数据单元来得到待预测数据单元的功能,具体的,该服务包括且不局限于用于进行文本数据(句子、段落、篇章等)中任意位置内容的预测恢复、语音数据中模糊语音或者缺失音频序列的恢复、图片/视频数据中模糊/破损像素的恢复等。
本申请实施例提供了一种数据处理方法,所述方法包括:获取M个第一嵌入向量、以及第二嵌入向量;其中,每个第一嵌入向量用于表示目标数据中的一个已知数据单元以及所述一个已知数据单元在所述目标数据中的第一位置,所述第二嵌入向量用于表示所述目标数据中的第一待预测数据单元在所述目标数据中的第二位置;所述M为正整数;通过目标编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,其中,每个所述已知数据单元对应的第一输出向量为根据所述M个第一嵌入向量生成的;通过目标预测网络,对所述M个第一输出向量以及所述第二嵌入向量进行处理,以得到所述第一待预测数据单元。通过上述方式,针对于M个已知数据单元对应的M个第一嵌入向量,目标编码器可以将M个第一嵌入向量作为输入,其中第一嵌入向量包括了位置信息和已知数据单元的数据信息,而不需要再单独设置额外的M个位置信息来作为目标编码器的输入,此外,目标编码器的中间输出的隐变量的数量也和输入的嵌入向量的数量保持一致,减少了目标编码器的计算量和内存消耗。
以上描述了模型的推理过程,接下来从模型训练的角度,对本申请实施例提供的数据处理方法进行描述,参照图15和图16,图15和图16为本申请实施例提供的一种数据处理方法的流程示意,如图15所示,本申请实施例提供的数据处理方法包括:
1501、获取第一编码器、第一预测网络、M个第一嵌入向量、以及第二嵌入向量;其中,每个第一嵌入向量用于表示目标数据中的一个已知数据单元以及所述一个已知数据单元在所述目标数据中的第一位置,所述第二嵌入向量用于表示所述目标数据中的第一待预测数据单元在所述目标数据中的第二位置;所述M为正整数;
本申请实施例中,第一编码器和第一预测网络为待训练的神经网络模型。
在一种可能的实现中,所述目标数据为文本数据,所述已知数据单元为所述文本数据中的已知词,所述第一待预测数据单元为所述文本数据中的待预测词;或者,
所述目标数据为语音数据,所述已知数据单元为所述语音数据中的已知音频序列,所 述第一待预测数据单元为所述语音数据中的待预测音频序列;或者,
所述目标数据为图像数据,所述已知数据单元为所述图像数据中的已知像素点,所述第一待预测数据单元为所述图像数据中的待预测像素点。
在一种可能的实现中,所述第一位置用于指示所述已知数据单元与其他已知数据单元以及所述已知数据单元与所述第一待预测数据单元之间的相对位置关系;所述第二位置用于指示所述第一待预测数据单元与所述目标数据中各个已知数据单元之间的相对位置关系。
在一种可能的实现中,所述第一编码器为第一转换transformer层,所述第一预测网络为第二transformer层。
更多关于步骤1501的描述可以参照步骤601的描述,这里不再赘述。
1502、通过所述第一编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,其中,每个所述已知数据单元对应的第一输出向量为根据所述M个第一嵌入向量生成的;
在一种可能的实现中,所述第一transformer层包括串行的多个子transformer层;可以通过每个所述子transformer层,对与所述子transformer层相邻的上一个子transformer层输出的数据进行处理,以得到M个中间向量,并将所述M个中间向量输出至与所述子transformer层相邻的下一个子transformer层;其中,若所述子transformer层为所述多个子transformer层中最靠近输入侧的transformer层,则所述子transformer层的输入数据为所述M个第一嵌入向量;若所述子transformer层为所述多个子transformer层中最靠近输出侧的transformer层,则所述子transformer层输出的数据为所述M个第一输出向量。
更多关于步骤1502的描述可以参照步骤602的描述,相似之处这里不再赘述。
1503、通过所述第一预测网络,对所述M个第一输出向量以及所述第二嵌入向量进行处理,以得到第三预测数据单元;
其中,第三预测数据单元为第一预测网络进行预测的结果。
更多关于步骤1503的描述可以参照步骤603的描述,相似之处这里不再赘述。
1504、基于所述第三预测数据单元与所述第一待预测数据单元之间的差异,更新所述第一编码器和所述第一预测网络,以得到目标编码器和目标预测网络。
其中,第三预测数据单元为第一预测网络进行预测的结果,因此需要基于第三预测数据单元和第一待预测数据单元之间的差异来构建损失,并基于构建的损失来更新第一编码器和所述第一预测网络,以得到目标编码器和目标预测网络。应理解,还可以基于上述损失去更新嵌入层等其他网络结构,这里并不限定。
在一种可能的实现中,所述目标数据还包括第二待预测数据单元;可以在所述通过所述第一编码器,对所述M个第一嵌入向量进行处理,以得到每个已知数据单元对应的第一输出向量之前,随机确定所述第一待预测数据单元与所述第二待预测数据单元的被预测先后顺序;若所述预测先后顺序用于指示所述第二待预测数据单元在所述第一待预测数据单元之后被预测,则在所述得到第三预测数据单元之后,获取第四嵌入向量和第五嵌入向量,所述第四嵌入向量用于表示所述第一待预测数据单元以及所述第一待预测数据单元在所述目标数据中的第二位置,所述第五嵌入向量用于表示所述目标数据中的第二待预测数据单 元在所述目标数据中的第三位置;通过所述第一编码器,对所述M个第一嵌入向量以及所述第四嵌入向量进行处理,以得到每个已知数据单元以及所述第一待预测数据单元对应的第二输出向量;通过所述第一预测网络,对M+1个第二输出向量以及所述第五嵌入向量进行处理,以得到所述第四待预测数据单元;进而可以基于所述第三预测数据单元与所述第一待预测数据单元之间的差异、以及所述第四预测数据单元与所述第二待预测数据单元之间的差异,更新所述第一编码器和所述第一预测网络,以得到目标编码器和目标预测网络。
在一种可能的实现中,所述每个已知数据单元对应的第二输出向量为根据所述M个第一嵌入向量生成的;所述第一待预测数据单元对应的第二输出向量为根据所述M个第一嵌入向量以及所述第四嵌入向量生成的。
以目标数据为文本数据为例,训练阶段的参数调优可以采用深度学习中标准的反向传播算法(back propagation)进行。该阶段的损失函数可以为:
L(θ 1)=logP(y|x;θ 1)=∑ i∈S logp(y i|x;θ 1);
其中θ 1是模型的所有参数(包括transformer参数,位置向量参数以及分类器参数),x是整个输入的序列,包含若干元素,y表示所有需要预测的词(即每个带预测位置对应的原词)组成的序列,S表示y中所有词的位置的合集,y i表示第i个位置需要预测的词。
参照图17,本申请实施例还提供了一种数据处理方法,所述方法包括:
1701、获取M个第一嵌入向量、以及第二嵌入向量;其中,每个第一嵌入向量用于表示目标数据中的一个数据单元以及所述一个数据单元在所述目标数据中的第一位置,所述第二嵌入向量用于指示目标处理任务;所述M为正整数;
和上述图6a对应的实施例不同的是,本实施例中的第二嵌入向量用于指示目标处理任务,其中,目标处理任务包括但不限于:短文本分类,长文本分类,自然语言推断,文本相似度匹配,情感分类等等。
在一种可能的实现中,所述第一位置用于指示所述数据单元与其他数据单元之间的相对位置关系。
在一种可能的实现中,所述目标数据为文本数据,所述数据单元为所述文本数据中的词;或者,
所述目标数据为语音数据,所述已知数据单元为所述语音数据中的音频序列;或者,
所述目标数据为图像数据,所述已知数据单元为所述图像数据中的像素点。
更多关于步骤1701的具体描述可以参照上述实施例中步骤601的描述,这里不再赘述。
1702、通过目标编码器,对所述M个第一嵌入向量进行处理,以得到M个数据单元对应的M个输出向量,其中,每个所述数据单元对应的输出向量为根据所述M个第一嵌入向量生成的。
本申请实施例中的目标编码器可以以图6a对应的实施例中的目标编码器作为预训练模型进行针对于目标处理任务的模型微调得到。
在一种可能的实现中,所述目标编码器为第一转换transformer层。
在一种可能的实现中,所述第一transformer层包括串行的多个子transformer层;可以通过每个所述子transformer层,对与所述子transformer层相邻的上一个子transformer层输 出的数据进行处理,以得到M个中间向量,并将所述M个中间向量输出至与所述子transformer层相邻的下一个子transformer层;其中,若所述子transformer层为所述多个子transformer层中最靠近输入侧的transformer层,则所述子transformer层的输入数据为所述M个第一嵌入向量;若所述子transformer层为所述多个子transformer层中最靠近输出侧的transformer层,则所述子transformer层输出的数据为所述M个输出向量。
在一种可能的实现中,所述目标编码器包括注意力头,可以获取注意力信息,所述注意力信息用于指示所述注意力头在处理所述M个第一嵌入向量时,所述M个第一嵌入向量中任意两个第一嵌入向量之间存在注意力关联;根据所述注意力信息,通过所述目标编码器,对所述M个第一嵌入向量进行处理。
更多关于步骤1702的具体描述可以参照上述实施例中步骤602的描述,这里不再赘述。
1703、通过任务网络,对所述M个输出向量以及所述第二嵌入向量进行所述目标处理任务对应的处理,以得到任务处理结果。
在一种可能的实现中,所述预测网络为第二transformer层。
接下来以目标处理任务为多任务文本分类任务和阅读理解为例,对本申请实施例提供的数据处理方法进行说明。
参照图18,图18为一个关于文本情感分类的实施例示例,在本例中,目标数据为“the cat sat on the mat”,根据句子分块得到注意力矩阵,该矩阵的第i行第j列的元素若为1(白色),表示在后续建模过程中,重排序列中第j个词对第i个词是可见的,反之是不可见的。该模块输出重排的词向量序列以及注意力矩阵(根据分块得到)输出给自回归词向量编码模块,将待预测标签的辅助信息(任务类型,例如:情感分类)输出到查询模块。
自回归模块可以采用transformer层作为自回归词向量编码器,该模块将重排的词向量序列中每个词向量与其对应的位置向量(每个位置对应一个位置向量,为模型的参数一部分)相加,并且在建模过程中采用预处理模块提供的注意力矩阵,该矩阵定义了在transformer建模词表示的过程中,每个词对其他词是否可见,图18中的实线表示可见,transformer最终为每一个词得到一个融入上下文信息的词向量表示,并且输出到预测模块。查询模块输出任务类型对应的任务向量并输出给预测模块。预测模块依旧采用transformer模型,该模型建模该句子的向量表示。最终建模的每个词向量经过一个分类器,用于预测对应的词。
在精调阶段的训练中,该模型预测句子对应的标签。精调阶段的参数调优可以采用深度学习中标准的反向传播算法(back propagation)进行。该阶段的损失函数可以为:
L(θ 2)=logP(y|x;θ 2)
其中θ 2是模型的所有参数(包括的Transformer参数,位置向量参数,任务编码参数以及分类器参数),x是整个输入的序列,包含若干元素,y表示句子对应的标签。
图19为一个关于片段(span)抽取的阅读理解的实施例示例,在该阅读理解任务中,给定一个问句(question)“who sat on the mat?”以及一个篇章(paragraph)“the cat sat on the mat”,该任务是找到篇章中答案的片段(span),即在篇章里面的起始(START)和结束(END)的位置(即“the”和“cat”)。在本例中,根据句子分块得到注意力矩阵,该矩阵的第i行第j列的元素若为1(白色),表示在后续建模过程中,重排序列中第j个词对第i个词是可见的,反之是不可见的。该模块输出重排的词向量序列以及注意力矩阵(根据 分块得到)输出给自回归词向量编码模块,将待预测标签的辅助信息(为篇章中每个词的位置信息)输出到查询模块。
自回归模块可以采用transformer作为自回归词向量编码器,该模块将重排的词向量序列中每个词向量与其对应的位置向量(每个位置对应一个位置向量,为模型的参数一部分)相加,并且在建模过程中采用预处理模块提供的注意力矩阵,该矩阵定义了在Transformer建模词表示的过程中,每个词对其他词是否可见,图中的实线表示可见,Transformer最终为每一个词得到一个融入上下文信息的词向量表示,并且输出到预测模块。查询模块输出任务类型对应的任务向量并输出给预测模块。预测模块依旧采用Transformer模型,该模型建模该句子的向量表示。最终建模的每个词向量经过两个分类器(分别输出每个词是否为START和END的概率,如图19中表格所示),用于预测对应的START和END的位置。
在精调阶段的训练中,该模型预测篇章中每个词对应的START和END的概率。精调阶段的参数调优采用深度学习中标准的反向传播算法进行。该阶段的损失函数可以为:
L(θ 3)=logP(y START|x;θ 3)+logP(y END|x;θ 3);
其中θ 3是模型的所有参数(包括的Transformer参数,位置向量参数,任务编码参数以及分类器参数),x是整个输入的序列,包含若干元素,P(y START|x;θ 3)表示该模型将答案中START位置的词预测为START的概率,P(y END|x;θ 3)表示该模型将答案中END位置的词预测为END的概率。
在推理阶段,过精调的模型可用作该下游任务的预测,以文本分类任务和阅读理解为例,该模型的预测方式与精调阶段相同,经过四个模块以及分类器得到句子或者词的标签。在阅读理解任务中,模型会取START分类器预测概率最大的词作为span的起始位置的词,然后会取起始位置之后END概率最大的词作为span的结束位置的词。
在图1至图19所对应的实施例的基础上,为了更好的实施本申请实施例的上述方案,下面还提供用于实施上述方案的相关设备。具体参阅图20,图20为本申请实施例提供的数据处理装置2000的一种结构示意图,数据处理装置2000可以是终端设备或服务器,数据处理装置2000包括:
获取模块2001,用于获取M个第一嵌入向量、以及第二嵌入向量;其中,每个第一嵌入向量用于表示目标数据中的一个已知数据单元以及所述一个已知数据单元在所述目标数据中的第一位置,所述第二嵌入向量用于表示所述目标数据中的第一待预测数据单元在所述目标数据中的第二位置;所述M为正整数;
关于获取模块2001的具体描述可以参照上述实施例中步骤601的描述,这里不再赘述。
编码模块2002,用于通过目标编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,其中,每个所述已知数据单元对应的第一输出向量为根据所述M个第一嵌入向量生成的;
关于编码模块2002的具体描述可以参照上述实施例中步骤602的描述,这里不再赘述。
预测模块2003,用于通过目标预测网络,对所述M个第一输出向量以及所述第二嵌入向量进行处理,以得到所述第一待预测数据单元。
关于预测模块2003的具体描述可以参照上述实施例中步骤603的描述,这里不再赘述。
在一种可能的实现中,所述第一位置用于指示所述已知数据单元与其他已知数据单元以及所述已知数据单元与所述第一待预测数据单元之间的相对位置关系;所述第二位置用于指示所述第一待预测数据单元与所述目标数据中各个已知数据单元之间的相对位置关系。
在一种可能的实现中,所述目标编码器为第一转换transformer层,所述目标预测网络为第二transformer层。
在一种可能的实现中,所述第一transformer层包括串行的多个子transformer层;所述通过目标编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,包括:
通过每个所述子transformer层,对与所述子transformer层相邻的上一个子transformer层输出的数据进行处理,以得到M个中间向量,并将所述M个中间向量输出至与所述子transformer层相邻的下一个子transformer层;其中,若所述子transformer层为所述多个子transformer层中最靠近输入侧的transformer层,则所述子transformer层的输入数据为所述M个第一嵌入向量;若所述子transformer层为所述多个子transformer层中最靠近输出侧的transformer层,则所述子transformer层输出的数据为所述M个第一输出向量。
在一种可能的实现中,所述目标编码器包括注意力头,所述编码模块,用于获取注意力信息,所述注意力信息用于指示所述注意力头在处理所述M个第一嵌入向量时,所述M个第一嵌入向量中任意两个第一嵌入向量之间存在注意力关联;
根据所述注意力信息,通过所述目标编码器,对所述M个第一嵌入向量进行处理。
在一种可能的实现中,所述装置还包括:
嵌入模块,用于通过嵌入层对所述目标数据中的M个已知数据单元进行嵌入处理,以得到M个第三嵌入向量;
获取所述M个已知数据单元中的每个已知数据单元的位置向量,所述位置向量用于指示所述第一位置;
将所述M个第三嵌入向量中的每个第三嵌入向量与对应的位置向量进行融合,以得到所述M个第一嵌入向量。
在一种可能的实现中,所述目标数据还包括第二待预测数据单元,且所述第二待预测数据单元与所述第一待预测数据单元的被预测先后顺序为随机确定的。
在一种可能的实现中,若所述第二待预测数据单元在所述第一待预测数据单元之后被预测,则所述方法还包括:
获取第四嵌入向量和第五嵌入向量,所述第四嵌入向量用于表示所述第一待预测数据单元以及所述第一待预测数据单元在所述目标数据中的第二位置,所述第五嵌入向量用于表示所述第二待预测数据单元在所述目标数据中的第三位置;
通过所述目标编码器,对所述M个第一嵌入向量以及所述第四嵌入向量进行处理,以得到M个已知数据单元以及所述第一待预测数据单元对应的M+1个第二输出向量;
通过所述目标预测网络,对所述M+1个第二输出向量以及所述第五嵌入向量进行处理,以得到所述第二待预测数据单元。
在一种可能的实现中,所述每个已知数据单元对应的第二输出向量为根据所述M个第 一嵌入向量生成的;所述第一待预测数据单元对应的第二输出向量为根据所述M个第一嵌入向量以及所述第四嵌入向量生成的。
在一种可能的实现中,所述目标数据为文本数据,所述已知数据单元为所述文本数据中的已知词,所述第一待预测数据单元为所述文本数据中的待预测词;或者,
所述目标数据为语音数据,所述已知数据单元为所述语音数据中的已知音频序列,所述第一待预测数据单元为所述语音数据中的待预测音频序列;或者,
所述目标数据为图像数据,所述已知数据单元为所述图像数据中的已知像素点,所述第一待预测数据单元为所述图像数据中的待预测像素点。
参阅图21,图21为本申请实施例提供的数据处理装置2100的一种结构示意图,数据处理装置2100可以是终端设备或服务器,数据处理装置2100包括:
获取模块2101,用于获取M个第一嵌入向量、以及第二嵌入向量;其中,每个第一嵌入向量用于表示目标数据中的一个数据单元以及所述一个数据单元在所述目标数据中的第一位置,所述第二嵌入向量用于指示目标处理任务;所述M为正整数;
关于获取模块2101的具体描述可以参照上述实施例中步骤1701的描述,这里不再赘述。
编码模块2102,用于通过目标编码器,对所述M个第一嵌入向量进行处理,以得到M个数据单元对应的M个输出向量,其中,每个所述数据单元对应的输出向量为根据所述M个第一嵌入向量生成的;
关于编码模块2102的具体描述可以参照上述实施例中步骤1702的描述,这里不再赘述。
任务处理模块2103,用于通过任务网络,对所述M个输出向量以及所述第二嵌入向量进行所述目标处理任务对应的处理,以得到任务处理结果。
关于任务处理模块2103的具体描述可以参照上述实施例中步骤1703的描述,这里不再赘述。
在一种可能的实现中,所述第一位置用于指示所述数据单元与其他数据单元之间的相对位置关系。
在一种可能的实现中,所述目标编码器为第一转换transformer层,所述任务网络为第二transformer层。
在一种可能的实现中,所述第一transformer层包括串行的多个子transformer层;所述通过目标编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,包括:
通过每个所述子transformer层,对与所述子transformer层相邻的上一个子transformer层输出的数据进行处理,以得到M个中间向量,并将所述M个中间向量输出至与所述子transformer层相邻的下一个子transformer层;其中,若所述子transformer层为所述多个子transformer层中最靠近输入侧的transformer层,则所述子transformer层的输入数据为所述M个第一嵌入向量;若所述子transformer层为所述多个子transformer层中最靠近输出侧的 transformer层,则所述子transformer层输出的数据为所述M个输出向量。
在一种可能的实现中,所述目标编码器包括注意力头,所述编码模块,用于获取注意力信息,所述注意力信息用于指示所述注意力头在处理所述M个第一嵌入向量时,所述M个第一嵌入向量中任意两个第一嵌入向量之间存在注意力关联;
根据所述注意力信息,通过所述目标编码器,对所述M个第一嵌入向量进行处理。
在一种可能的实现中,所述目标数据为文本数据,所述数据单元为所述文本数据中的词;或者,
所述目标数据为语音数据,所述已知数据单元为所述语音数据中的音频序列;或者,
所述目标数据为图像数据,所述已知数据单元为所述图像数据中的像素点。
在一种可能的实现中,所述目标处理任务包括短文本分类、长文本分类、自然语言推断、文本相似度匹配或文本情感分类。
参阅图22,图22为本申请实施例提供的数据处理装置2200的一种结构示意图,数据处理装置2200可以是终端设备或服务器,数据处理装置2200包括:
获取模块2201,用于获取第一编码器、第一预测网络、M个第一嵌入向量、以及第二嵌入向量;其中,每个第一嵌入向量用于表示目标数据中的一个已知数据单元以及所述一个已知数据单元在所述目标数据中的第一位置,所述第二嵌入向量用于表示所述目标数据中的第一待预测数据单元在所述目标数据中的第二位置;所述M为正整数;
关于获取模块2201的具体描述可以参照上述实施例中步骤1501的描述,这里不再赘述。
编码模块2202,用于通过所述第一编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,其中,每个所述已知数据单元对应的第一输出向量为根据所述M个第一嵌入向量生成的;
关于编码模块2202的具体描述可以参照上述实施例中步骤1502的描述,这里不再赘述。
预测模块2203,用于通过所述第一预测网络,对所述M个第一输出向量以及所述第二嵌入向量进行处理,以得到第三预测数据单元;
关于预测模块2203的具体描述可以参照上述实施例中步骤1503的描述,这里不再赘述。
模型训练模块2204,用于基于所述第三预测数据单元与所述第一待预测数据单元之间的差异,更新所述第一编码器和所述第一预测网络,以得到目标编码器和目标预测网络。
关于模型训练模块2204的具体描述可以参照上述实施例中步骤1504的描述,这里不再赘述。
在一种可能的实现中,所述第一位置用于指示所述已知数据单元与其他已知数据单元以及所述已知数据单元与所述第一待预测数据单元之间的相对位置关系;所述第二位置用于指示所述第一待预测数据单元与所述目标数据中各个已知数据单元之间的相对位置关系。
在一种可能的实现中,所述第一编码器为第一转换transformer层,所述第一预测网络 为第二transformer层。
在一种可能的实现中,所述第一transformer层包括串行的多个子transformer层;所述通过所述第一编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,包括:
通过每个所述子transformer层,对与所述子transformer层相邻的上一个子transformer层输出的数据进行处理,以得到M个中间向量,并将所述M个中间向量输出至与所述子transformer层相邻的下一个子transformer层;其中,若所述子transformer层为所述多个子transformer层中最靠近输入侧的transformer层,则所述子transformer层的输入数据为所述M个第一嵌入向量;若所述子transformer层为所述多个子transformer层中最靠近输出侧的transformer层,则所述子transformer层输出的数据为所述M个第一输出向量。
在一种可能的实现中,所述目标数据还包括第二待预测数据单元,且所述第二待预测数据单元与所述第一待预测数据单元的被预测先后顺序为随机确定的。
在一种可能的实现中,若所述第二待预测数据单元在所述第一待预测数据单元之后被预测,则所述方法还包括:
获取第四嵌入向量和第五嵌入向量,所述第四嵌入向量用于表示所述第一待预测数据单元以及所述第一待预测数据单元在所述目标数据中的第二位置,所述第五嵌入向量用于表示所述目标数据中的第二待预测数据单元在所述目标数据中的第三位置;
通过所述第一编码器,对所述M个第一嵌入向量以及所述第四嵌入向量进行处理,以得到M个已知数据单元以及所述第一待预测数据单元对应的M+1个第二输出向量;
通过所述第一预测网络,对所述M+1个第二输出向量以及所述第五嵌入向量进行处理,以得到所述第四待预测数据单元;
所述基于所述第三预测数据单元与所述第一待预测数据单元之间的差异,更新所述第一编码器和所述第一预测网络,以得到目标编码器和目标预测网络,包括:
基于所述第三预测数据单元与所述第一待预测数据单元之间的差异、以及所述第四预测数据单元与所述第二待预测数据单元之间的差异,更新所述第一编码器和所述第一预测网络,以得到目标编码器和目标预测网络。
在一种可能的实现中,所述每个已知数据单元对应的第二输出向量为根据所述M个第一嵌入向量生成的;所述第一待预测数据单元对应的第二输出向量为根据所述M个第一嵌入向量以及所述第四嵌入向量生成的。
在一种可能的实现中,所述目标数据为文本数据,所述已知数据单元为所述文本数据中的已知词,所述第一待预测数据单元为所述文本数据中的待预测词;或者,
所述目标数据为语音数据,所述已知数据单元为所述语音数据中的已知音频序列,所述第一待预测数据单元为所述语音数据中的待预测音频序列;或者,
所述目标数据为图像数据,所述已知数据单元为所述图像数据中的已知像素点,所述第一待预测数据单元为所述图像数据中的待预测像素点。
接下来介绍本申请实施例提供的一种执行设备,请参阅图23,图23为本申请实施例 提供的执行设备的一种结构示意图,执行设备2300具体可以表现为虚拟现实VR设备、手机、平板、笔记本电脑、智能穿戴设备、监控数据处理设备或服务器等,此处不做限定。具体的,执行设备2300包括:接收器2301、发射器2302、处理器2303和存储器2304(其中执行设备2300中的处理器2303的数量可以一个或多个,图23中以一个处理器为例),其中,处理器2303可以包括应用处理器23031和通信处理器23032。在本申请的一些实施例中,接收器2301、发射器2302、处理器2303和存储器2304可通过总线或其它方式连接。
存储器2304可以包括只读存储器和随机存取存储器,并向处理器2303提供指令和数据。存储器2304的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。存储器2304存储有处理器和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。
处理器2303控制执行设备的操作。具体的应用中,执行设备的各个组件通过总线***耦合在一起,其中总线***除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线***。
上述本申请实施例揭示的方法可以应用于处理器2303中,或者由处理器2303实现。处理器2303可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器2303中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器2303可以是通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器,还可进一步包括专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。该处理器2303可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器2304,处理器2303读取存储器2304中的信息,结合其硬件完成上述方法的步骤。
接收器2301可用于接收输入的数字或字符信息,以及产生与执行设备的相关设置以及功能控制有关的信号输入。发射器2302可用于通过第一接口输出数字或字符信息;发射器2302还可用于通过第一接口向磁盘组发送指令,以修改磁盘组中的数据;发射器2302还可以包括显示屏等显示设备。
本申请实施例中,在一种情况下,处理器2303,用于执行图6a、图17对应实施例描述的数据处理方法。
本申请实施例还提供了一种训练设备,请参阅图24,图24是本申请实施例提供的训练设备一种结构示意图,具体的,训练设备2400由一个或多个服务器实现,训练设备2400可因配置或性能不同而产生比较大的差异,可以包括一个或一个以***处理器(central  processing units,CPU)2424(例如,一个或一个以上处理器)和存储器2432,一个或一个以上存储应用程序2442或数据2444的存储介质2430(例如一个或一个以上海量存储设备)。其中,存储器2432和存储介质2430可以是短暂存储或持久存储。存储在存储介质2430的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对训练设备中的一系列指令操作。更进一步地,中央处理器2424可以设置为与存储介质2430通信,在训练设备2400上执行存储介质2430中的一系列指令操作。
训练设备2400还可以包括一个或一个以上电源2426,一个或一个以上有线或无线网络接口2450,一个或一个以上输入输出接口2458;或,一个或一个以上操作***2441,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
本申请实施例中,中央处理器2424,用于执行图15对应实施例中描述的数据处理方法。
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于进行信号处理的程序,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。
本申请实施例提供的执行设备、训练设备或终端设备具体可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使执行设备内的芯片执行上述实施例描述的数据处理方法,或者,以使训练设备内的芯片执行上述实施例描述的数据处理方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
具体的,请参阅图25,图25为本申请实施例提供的芯片的一种结构示意图,上述图6a、图15和图17对应的实施例中描述的数据处理方法可以在图25所示的芯片中实现。具体的,所述芯片可以表现为神经网络处理器NPU 2500,NPU 2500作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路2503,控制器2504控制运算电路2503提取存储器(权重存储器或输入存储器)中的数据并进行运算。
上述图6a、图15和图17对应的实施例中描述的数据处理方法可以由图25所示的芯片中的主CPU和NPU共同配合完成。
在一些实现中,运算电路2503内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路2503是二维脉动阵列。运算电路2503还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路2503是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器2502中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器2501中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)2508中。
统一存储器2506用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)2505,DMAC被搬运到权重存储器2502中。输入数据也通过DMAC被搬运到统一存储器2506中。
BIU为Bus Interface Unit即,总线接口单元2510,用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer,IFB)2509的交互。
总线接口单元2510(Bus Interface Unit,简称BIU),用于取指存储器2509从外部存储器获取指令,还用于存储单元访问控制器2505从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器2506或将权重数据搬运到权重存储器2502中或将输入数据数据搬运到输入存储器2501中。
向量计算单元2507包括多个运算处理单元,在需要的情况下,对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如Batch Normalization(批归一化),像素级求和,对特征平面进行上采样等。
在一些实现中,向量计算单元2507能将经处理的输出的向量存储到统一存储器2506。例如,向量计算单元2507可以将线性函数;或,非线性函数应用到运算电路2503的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元2507生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路2503的激活输入,例如用于在神经网络中的后续层中的使用。
控制器2504连接的取指存储器(instruction fetch buffer)2509,用于存储控制器2504使用的指令;
统一存储器2506,输入存储器2501,权重存储器2502以及取指存储器2509均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述程序执行的集成电路。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软 件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。

Claims (28)

  1. 一种数据处理方法,其特征在于,所述方法包括:
    获取M个第一嵌入向量、以及第二嵌入向量;其中,每个第一嵌入向量用于表示目标数据中的一个已知数据单元以及所述一个已知数据单元在所述目标数据中的第一位置,所述第二嵌入向量用于表示所述目标数据中的第一待预测数据单元在所述目标数据中的第二位置;所述M为正整数;
    通过目标编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,其中,每个所述已知数据单元对应的第一输出向量为根据所述M个第一嵌入向量生成的;
    通过目标预测网络,对所述M个第一输出向量以及所述第二嵌入向量进行处理,以得到所述第一待预测数据单元。
  2. 根据权利要求1所述的方法,其特征在于,所述第一位置用于指示所述已知数据单元与其他已知数据单元以及所述已知数据单元与所述第一待预测数据单元之间的相对位置关系;所述第二位置用于指示所述第一待预测数据单元与所述目标数据中各个已知数据单元之间的相对位置关系。
  3. 根据权利要求1或2所述的方法,其特征在于,所述目标编码器为第一转换transformer层,所述目标预测网络为第二transformer层。
  4. 根据权利要求3所述的方法,其特征在于,所述第一transformer层包括串行的多个子transformer层;所述通过目标编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,包括:
    通过每个所述子transformer层,对与所述子transformer层相邻的上一个子transformer层输出的数据进行处理,以得到M个中间向量,并将所述M个中间向量输出至与所述子transformer层相邻的下一个子transformer层;其中,若所述子transformer层为所述多个子transformer层中最靠近输入侧的transformer层,则所述子transformer层的输入数据为所述M个第一嵌入向量;若所述子transformer层为所述多个子transformer层中最靠近输出侧的transformer层,则所述子transformer层输出的数据为所述M个第一输出向量。
  5. 根据权利要求1至4任一所述的方法,其特征在于,所述目标编码器包括注意力头,所述通过目标编码器,对所述M个第一嵌入向量进行处理,包括:
    获取注意力信息,所述注意力信息用于指示所述注意力头在处理所述M个第一嵌入向量时,所述M个第一嵌入向量中任意两个第一嵌入向量之间存在注意力关联;
    根据所述注意力信息,通过所述目标编码器,对所述M个第一嵌入向量进行处理。
  6. 根据权利要求1至5任一所述的方法,其特征在于,所述方法还包括:
    通过嵌入层对所述目标数据中的M个已知数据单元进行嵌入处理,以得到M个第三嵌入向量;
    获取所述M个已知数据单元中的每个已知数据单元的位置向量,所述位置向量用于指示所述第一位置;
    将所述M个第三嵌入向量中的每个第三嵌入向量与对应的位置向量进行融合,以得到所述M个第一嵌入向量。
  7. 根据权利要求1至6任一所述的方法,其特征在于,所述目标数据还包括第二待预测数据单元,且所述第二待预测数据单元与所述第一待预测数据单元的被预测先后顺序为随机确定的。
  8. 根据权利要求7所述的方法,其特征在于,若所述第二待预测数据单元在所述第一待预测数据单元之后被预测,则所述方法还包括:
    获取第四嵌入向量和第五嵌入向量,所述第四嵌入向量用于表示所述第一待预测数据单元以及所述第一待预测数据单元在所述目标数据中的第二位置,所述第五嵌入向量用于表示所述第二待预测数据单元在所述目标数据中的第三位置;
    通过所述目标编码器,对所述M个第一嵌入向量以及所述第四嵌入向量进行处理,以得到M个已知数据单元以及所述第一待预测数据单元对应的M+1个第二输出向量;
    通过所述目标预测网络,对所述M+1个第二输出向量以及所述第五嵌入向量进行处理,以得到所述第二待预测数据单元。
  9. 根据权利要求7或8所述的方法,其特征在于,所述每个已知数据单元对应的第二输出向量为根据所述M个第一嵌入向量生成的;所述第一待预测数据单元对应的第二输出向量为根据所述M个第一嵌入向量以及所述第四嵌入向量生成的。
  10. 根据权利要求1至9任一所述的方法,其特征在于,所述目标数据为文本数据,所述已知数据单元为所述文本数据中的已知词,所述第一待预测数据单元为所述文本数据中的待预测词;或者,
    所述目标数据为语音数据,所述已知数据单元为所述语音数据中的已知音频序列,所述第一待预测数据单元为所述语音数据中的待预测音频序列;或者,
    所述目标数据为图像数据,所述已知数据单元为所述图像数据中的已知像素点,所述第一待预测数据单元为所述图像数据中的待预测像素点。
  11. 一种数据处理方法,其特征在于,所述方法包括:
    获取M个第一嵌入向量、以及第二嵌入向量;其中,每个第一嵌入向量用于表示目标数据中的一个数据单元以及所述一个数据单元在所述目标数据中的第一位置,所述第二嵌入向量用于指示目标处理任务;所述M为正整数;
    通过目标编码器,对所述M个第一嵌入向量进行处理,以得到M个数据单元对应的M个输出向量,其中,每个所述数据单元对应的输出向量为根据所述M个第一嵌入向量生成的;
    通过任务网络,对所述M个输出向量以及所述第二嵌入向量进行所述目标处理任务对应的处理,以得到任务处理结果。
  12. 根据权利要求11所述的方法,其特征在于,所述第一位置用于指示所述数据单元与其他数据单元之间的相对位置关系。
  13. 根据权利要求11或12所述的方法,其特征在于,所述目标编码器为第一转换transformer层,所述任务网络为第二transformer层。
  14. 根据权利要求13所述的方法,其特征在于,所述第一transformer层包括串行的多个子transformer层;所述通过目标编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,包括:
    通过每个所述子transformer层,对与所述子transformer层相邻的上一个子transformer层输出的数据进行处理,以得到M个中间向量,并将所述M个中间向量输出至与所述子transformer层相邻的下一个子transformer层;其中,若所述子transformer层为所述多个子transformer层中最靠近输入侧的transformer层,则所述子transformer层的输入数据为所述M个第一嵌入向量;若所述子transformer层为所述多个子transformer层中最靠近输出侧的transformer层,则所述子transformer层输出的数据为所述M个输出向量。
  15. 根据权利要求11至14任一所述的方法,其特征在于,所述目标编码器包括注意力头,所述通过目标编码器,对所述M个第一嵌入向量进行处理,包括:
    获取注意力信息,所述注意力信息用于指示所述注意力头在处理所述M个第一嵌入向量时,所述M个第一嵌入向量中任意两个第一嵌入向量之间存在注意力关联;
    根据所述注意力信息,通过所述目标编码器,对所述M个第一嵌入向量进行处理。
  16. 根据权利要求11至15任一所述的方法,其特征在于,所述目标数据为文本数据,所述数据单元为所述文本数据中的词;或者,
    所述目标数据为语音数据,所述已知数据单元为所述语音数据中的音频序列;或者,
    所述目标数据为图像数据,所述已知数据单元为所述图像数据中的像素点。
  17. 根据权利要求11至16任一所述的方法,其特征在于,所述目标处理任务包括短文本分类、长文本分类、自然语言推断、文本相似度匹配或文本情感分类。
  18. 一种数据处理方法,其特征在于,所述方法包括:
    获取第一编码器、第一预测网络、M个第一嵌入向量、以及第二嵌入向量;其中,每个第一嵌入向量用于表示目标数据中的一个已知数据单元以及所述一个已知数据单元在所述目标数据中的第一位置,所述第二嵌入向量用于表示所述目标数据中的第一待预测数据单元在所述目标数据中的第二位置;所述M为正整数;
    通过所述第一编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,其中,每个所述已知数据单元对应的第一输出向量为根据所述M个第一嵌入向量生成的;
    通过所述第一预测网络,对所述M个第一输出向量以及所述第二嵌入向量进行处理,以得到第三预测数据单元;
    基于所述第三预测数据单元与所述第一待预测数据单元之间的差异,更新所述第一编码器和所述第一预测网络,以得到目标编码器和目标预测网络。
  19. 根据权利要求18所述的方法,其特征在于,所述第一位置用于指示所述已知数据单元与其他已知数据单元以及所述已知数据单元与所述第一待预测数据单元之间的相对位置关系;所述第二位置用于指示所述第一待预测数据单元与所述目标数据中各个已知数据单元之间的相对位置关系。
  20. 根据权利要求18或19所述的方法,其特征在于,所述第一编码器为第一转换transformer层,所述第一预测网络为第二transformer层。
  21. 根据权利要求20所述的方法,其特征在于,所述第一transformer层包括串行的多个子transformer层;所述通过所述第一编码器,对所述M个第一嵌入向量进行处理,以得到M个已知数据单元对应的M个第一输出向量,包括:
    通过每个所述子transformer层,对与所述子transformer层相邻的上一个子transformer层输出的数据进行处理,以得到M个中间向量,并将所述M个中间向量输出至与所述子transformer层相邻的下一个子transformer层;其中,若所述子transformer层为所述多个子transformer层中最靠近输入侧的transformer层,则所述子transformer层的输入数据为所述M个第一嵌入向量;若所述子transformer层为所述多个子transformer层中最靠近输出侧的transformer层,则所述子transformer层输出的数据为所述M个第一输出向量。
  22. 根据权利要求18至21任一所述的方法,其特征在于,所述目标数据还包括第二待预测数据单元,且所述第二待预测数据单元与所述第一待预测数据单元的被预测先后顺序为随机确定的。
  23. 根据权利要求22所述的方法,其特征在于,若所述第二待预测数据单元在所述第一待预测数据单元之后被预测,则所述方法还包括:
    获取第四嵌入向量和第五嵌入向量,所述第四嵌入向量用于表示所述第一待预测数据 单元以及所述第一待预测数据单元在所述目标数据中的第二位置,所述第五嵌入向量用于表示所述目标数据中的第二待预测数据单元在所述目标数据中的第三位置;
    通过所述第一编码器,对所述M个第一嵌入向量以及所述第四嵌入向量进行处理,以得到M个已知数据单元以及所述第一待预测数据单元对应的M+1个第二输出向量;
    通过所述第一预测网络,对所述M+1个第二输出向量以及所述第五嵌入向量进行处理,以得到所述第四待预测数据单元;
    所述基于所述第三预测数据单元与所述第一待预测数据单元之间的差异,更新所述第一编码器和所述第一预测网络,以得到目标编码器和目标预测网络,包括:
    基于所述第三预测数据单元与所述第一待预测数据单元之间的差异、以及所述第四预测数据单元与所述第二待预测数据单元之间的差异,更新所述第一编码器和所述第一预测网络,以得到目标编码器和目标预测网络。
  24. 根据权利要求22或23所述的方法,其特征在于,所述每个已知数据单元对应的第二输出向量为根据所述M个第一嵌入向量生成的;所述第一待预测数据单元对应的第二输出向量为根据所述M个第一嵌入向量以及所述第四嵌入向量生成的。
  25. 根据权利要求18至24任一所述的方法,其特征在于,所述目标数据为文本数据,所述已知数据单元为所述文本数据中的已知词,所述第一待预测数据单元为所述文本数据中的待预测词;或者,
    所述目标数据为语音数据,所述已知数据单元为所述语音数据中的已知音频序列,所述第一待预测数据单元为所述语音数据中的待预测音频序列;或者,
    所述目标数据为图像数据,所述已知数据单元为所述图像数据中的已知像素点,所述第一待预测数据单元为所述图像数据中的待预测像素点。
  26. 一种数据处理装置,其特征在于,所述装置包括存储器和处理器;所述存储器存储有代码,所述处理器被配置为获取所述代码,并执行如权利要求1至25任一所述的方法。
  27. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有一个或多个指令,所述指令在由一个或多个计算机执行时使得所述一个或多个计算机实施权利要求1至25任一所述的方法。
  28. 一种计算机程序产品,其特征在于,所述计算机程序产品包括代码,当所述代码被执行时,用于实现权利要求1至25任一项所述的方法的步骤。
PCT/CN2022/087028 2021-04-18 2022-04-15 一种数据处理方法及相关设备 WO2022222854A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP22790962.9A EP4318322A1 (en) 2021-04-18 2022-04-15 Data processing method and related device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110415349.1 2021-04-18
CN202110415349.1A CN115292439A (zh) 2021-04-18 2021-04-18 一种数据处理方法及相关设备

Publications (1)

Publication Number Publication Date
WO2022222854A1 true WO2022222854A1 (zh) 2022-10-27

Family

ID=83721959

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/087028 WO2022222854A1 (zh) 2021-04-18 2022-04-15 一种数据处理方法及相关设备

Country Status (4)

Country Link
US (1) US20240046067A1 (zh)
EP (1) EP4318322A1 (zh)
CN (1) CN115292439A (zh)
WO (1) WO2022222854A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975654A (zh) * 2023-08-22 2023-10-31 腾讯科技(深圳)有限公司 对象互动方法、装置、电子设备、存储介质及程序产品

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116967640B (zh) * 2023-09-22 2024-01-05 杭州众能光电科技有限公司 钙钛矿电池层跟随除尘控制装置及方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007028658A (ja) * 2006-09-12 2007-02-01 Sony Corp データ処理装置、データ処理方法、およびプログラム
CN108446374A (zh) * 2018-03-16 2018-08-24 北京三快在线科技有限公司 用户意图预测方法、装置、电子设备、存储介质
CN110489555A (zh) * 2019-08-21 2019-11-22 创新工场(广州)人工智能研究有限公司 一种结合类词信息的语言模型预训练方法
CN111160050A (zh) * 2019-12-20 2020-05-15 沈阳雅译网络技术有限公司 一种基于上下文记忆网络的篇章级神经机器翻译方法
CN112527938A (zh) * 2020-12-17 2021-03-19 安徽迪科数金科技有限公司 基于自然语言理解的中文poi匹配方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007028658A (ja) * 2006-09-12 2007-02-01 Sony Corp データ処理装置、データ処理方法、およびプログラム
CN108446374A (zh) * 2018-03-16 2018-08-24 北京三快在线科技有限公司 用户意图预测方法、装置、电子设备、存储介质
CN110489555A (zh) * 2019-08-21 2019-11-22 创新工场(广州)人工智能研究有限公司 一种结合类词信息的语言模型预训练方法
CN111160050A (zh) * 2019-12-20 2020-05-15 沈阳雅译网络技术有限公司 一种基于上下文记忆网络的篇章级神经机器翻译方法
CN112527938A (zh) * 2020-12-17 2021-03-19 安徽迪科数金科技有限公司 基于自然语言理解的中文poi匹配方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975654A (zh) * 2023-08-22 2023-10-31 腾讯科技(深圳)有限公司 对象互动方法、装置、电子设备、存储介质及程序产品
CN116975654B (zh) * 2023-08-22 2024-01-05 腾讯科技(深圳)有限公司 对象互动方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
US20240046067A1 (en) 2024-02-08
CN115292439A (zh) 2022-11-04
EP4318322A1 (en) 2024-02-07

Similar Documents

Publication Publication Date Title
WO2021159714A1 (zh) 一种数据处理方法及相关设备
Kaymak et al. A brief survey and an application of semantic image segmentation for autonomous driving
WO2021164772A1 (zh) 训练跨模态检索模型的方法、跨模态检索的方法和相关装置
WO2022057776A1 (zh) 一种模型压缩方法及装置
WO2022068627A1 (zh) 一种数据处理方法及相关设备
CN111898696A (zh) 伪标签及标签预测模型的生成方法、装置、介质及设备
WO2023160472A1 (zh) 一种模型训练方法及相关设备
WO2022222854A1 (zh) 一种数据处理方法及相关设备
WO2022253074A1 (zh) 一种数据处理方法及相关设备
US11983903B2 (en) Processing images using self-attention based neural networks
WO2022001232A1 (zh) 一种问答数据增强方法、装置、计算机设备及存储介质
WO2023236977A1 (zh) 一种数据处理方法及相关设备
WO2024041479A1 (zh) 一种数据处理方法及其装置
CN113011568B (zh) 一种模型的训练方法、数据处理方法及设备
WO2023020613A1 (zh) 一种模型蒸馏方法及相关设备
US20240152770A1 (en) Neural network search method and related device
WO2023231954A1 (zh) 一种数据的去噪方法以及相关设备
WO2022111387A1 (zh) 一种数据处理方法及相关装置
CN116432019A (zh) 一种数据处理方法及相关设备
CN112115744B (zh) 点云数据的处理方法及装置、计算机存储介质、电子设备
CN116541492A (zh) 一种数据处理方法及相关设备
Srinivas et al. A comprehensive survey of techniques, applications, and challenges in deep learning: A revolution in machine learning
Bayoudh A survey of multimodal hybrid deep learning for computer vision: Architectures, applications, trends, and challenges
CN111445545B (zh) 一种文本转贴图方法、装置、存储介质及电子设备
WO2023231753A1 (zh) 一种神经网络的训练方法、数据的处理方法以及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22790962

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022790962

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022790962

Country of ref document: EP

Effective date: 20231024

NENP Non-entry into the national phase

Ref country code: DE