CN113887730A - Quantum simulator implementation method and device, related equipment and quantum simulation method - Google Patents

Quantum simulator implementation method and device, related equipment and quantum simulation method Download PDF

Info

Publication number
CN113887730A
CN113887730A CN202111040089.0A CN202111040089A CN113887730A CN 113887730 A CN113887730 A CN 113887730A CN 202111040089 A CN202111040089 A CN 202111040089A CN 113887730 A CN113887730 A CN 113887730A
Authority
CN
China
Prior art keywords
quantum
data
processor
simulator
bit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111040089.0A
Other languages
Chinese (zh)
Inventor
杨凯
范登栋
张超
刘勇翔
徐鹏翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peng Cheng Laboratory
Original Assignee
Peng Cheng Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peng Cheng Laboratory filed Critical Peng Cheng Laboratory
Priority to CN202111040089.0A priority Critical patent/CN113887730A/en
Publication of CN113887730A publication Critical patent/CN113887730A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N10/00Quantum computing, i.e. information processing based on quantum-mechanical phenomena

Landscapes

  • Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Condensed Matter Physics & Semiconductors (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a quantum simulator implementation method, a device, related equipment and a quantum simulation method, wherein the quantum simulator implementation method comprises the following steps: constructing a quantum bit gate operator, wherein the quantum bit gate operator is used for carrying data in target equipment based on an operation bit of a quantum gate and updating an attitude vector; constructing a quantum simulator, wherein the quantum simulator comprises the qubit gate operator, and the quantum simulator is a software package for realizing quantum circuit simulation; deploying the quantum simulator into the target device, wherein the target device is an elevated AI processor. Compared with the prior art, the quantum bit gate operator which can carry data in the rising AI processor and update the attitude vector is obtained in the scheme of the invention, and the quantum simulator which can be deployed in the rising AI processor to run is obtained according to the quantum bit gate operator, thereby being beneficial to fully utilizing the computing capability of the rising AI processor to realize quantum simulation.

Description

Quantum simulator implementation method and device, related equipment and quantum simulation method
Technical Field
The invention relates to the technical field of quantum computation simulation, in particular to a quantum simulator implementation method, a device, related equipment and a quantum simulation method.
Background
Quantum computing theory has shown its great potential in solving some important problems beyond the capabilities of classical computer computing, for example, in cryptography, financial modeling, and machine learning scenarios. Some quantum computers have been proposed in recent years, but at present quantum computers have not been widely used.
In the prior art, quantum simulation is usually realized through quantum circuit simulation. Quantum computation circuit simulation is essentially a software simulation of the computational process of a real quantum computer, and the software package for implementing quantum circuit simulation is generally called a quantum simulator. The problem in the prior art is that the existing quantum simulator can only run on a CPU or a GPU, and no quantum simulator which can run on an Itanium AI processor exists at present. Users can only use CPU resources to run the traditional quantum simulator on the training server equipped with the soaring AI processor, which is not good for fully utilizing the computing power of the soaring processor.
Thus, there is still a need for improvement and development of the prior art.
Disclosure of Invention
The present invention provides a quantum simulator implementation method, device, related apparatus and quantum simulation method, which aims to solve the problem that the quantum simulator in the prior art cannot run on an soar AI processor and is not favorable for fully utilizing the computing capability of the soar AI processor.
In order to achieve the above object, a first aspect of the present invention provides a quantum simulator implementation method, where the method includes:
constructing a quantum bit gate operator, wherein the quantum bit gate operator is used for carrying data in target equipment based on quantum simulation operation bits and carrying out state vector updating;
constructing a quantum simulator, wherein the quantum simulator comprises the qubit gate operator, and the quantum simulator is a software package for realizing quantum circuit simulation;
deploying the quantum simulator into the target device, wherein the target device is an elevated AI processor.
Optionally, the qubit gate operator is configured to perform single-qubit gate operation on the qubit.
Optionally, the qubit gate operator performs quantum simulation based on the following steps:
obtaining the type of data to be calculated and the core number and buffer size of the rising AI processor;
starting multi-core processing based on the core number and dividing the state vector corresponding to the data to be calculated;
carrying the divided state vectors to an output buffer area from a global memory, and then carrying out matrix vector multiplication operation corresponding to quantum bit gate operation to obtain calculated data;
and transferring the calculated data from the output buffer to the global memory.
Optionally, the qubit gate operator includes a distributed single-qubit gate operator, a low-order single-qubit gate operator, and a high-order single-qubit gate operator, which correspond to different operation positions, respectively.
Optionally, the constructing the qubit gate operator includes:
acquiring a distributed high operation bit range, a low operation bit range and a non-distributed high operation bit range based on the quantum line and the target device, wherein the operation bits in the distributed high operation bit range are operation bits required to perform distributed communication, the address distance corresponding to the operation bits in the low operation bit range is smaller than the minimum data transfer scale of the rising AI processor, the address distance corresponding to the operation bits in the non-distributed high operation bit range is not smaller than the minimum data transfer scale of the rising AI processor, and no distributed communication exists;
and respectively constructing a distributed single-quantum-bit gate operator corresponding to the distributed high-operation bit range, a low-order single-quantum-bit gate operator corresponding to the low-operation bit range and a high-order single-quantum-bit gate operator corresponding to the non-distributed high-operation bit range.
Optionally, the distributed single-quantum-bit gate operator performs data transfer based on the following steps:
acquiring data to be calculated, wherein the data to be calculated comprises a local state vector and a pairing state vector, the pairing state vector is a state vector which is obtained through distributed communication and is paired with the local state vector, and the local state vector and the pairing state vector are located in different helistat AI processors;
and circularly carrying the local state vector and the paired state vector based on the length of the local state vector and a preset segment length, carrying data segments with the same position and the same length in the local state vector and the paired state vector from the global memory to the output buffer area for calculation each time of circulation, and carrying the calculated data from the output buffer area to the global memory after the calculation is finished.
Optionally, the high-order single-qubit gate operator performs data transport based on the following steps:
acquiring data to be calculated, wherein the data to be calculated is a local attitude vector, and the local attitude vector comprises two paired attitude vectors which are positioned in the same helianthus AI processor and have a distance not less than the minimum data handling dimension of the helianthus AI processor;
acquiring the length of a continuous transmission data segment and the interval of adjacent continuous data segments;
and loading data from different positions of the local state vector in a global memory to the output buffer area for calculation based on the length of the continuous transmission data segment and the interval of the adjacent continuous data segments, and carrying the calculated data from the output buffer area to the global memory after the calculation is finished.
Optionally, the low-order single-qubit gate operator performs data transport based on the following steps:
acquiring data to be calculated, wherein the data to be calculated is a local attitude vector, and the local attitude vector comprises two paired attitude vectors which are positioned in the same helianthus AI processor and have a distance smaller than the minimum data handling dimension of the helianthus AI processor;
loading data into the output buffer area from the 0 th element of the local state vector in the global memory to obtain a zero-th array;
loading data into the output cache region from the No. 1 element of the local state vector in the global memory to obtain a first array;
loading data into the output cache region from the-1 element of the local state vector in the global memory to obtain a second array;
after the third array is obtained by calculating the upper half part of the unitary matrix based on the zero array and the first data, and the fourth array is obtained by calculating the lower half part of the unitary matrix based on the second array and the zero array, the third array and the fourth array are selected according to bits to obtain a fifth array, and the data in the fifth array are carried to the global memory from the output buffer area.
Optionally, the constructing a quantum simulator, where the quantum simulator includes the qubit gate operator, and the quantum simulator is a software package for implementing quantum line simulation, includes:
based on the packing of the qubit gate operator, a software package capable of being deployed and compiled for execution in the soar AI processor is generated as the quantum simulator.
A second aspect of the present invention provides a quantum simulator implementation apparatus, wherein the apparatus comprises:
the operator construction module is used for constructing a quantum bit gate operator, wherein the quantum bit gate operator is used for carrying data in target equipment and updating an attitude vector based on an operation bit calculated by a quantum gate;
the quantum simulator building module is used for building a quantum simulator, wherein the quantum simulator comprises the quantum bit gate operator, and the quantum simulator is a software package for realizing quantum line simulation;
a deployment module for deploying the quantum simulator to the target device, wherein the target device is an eather AI processor.
The third aspect of the present invention provides a quantum simulator implementation method of a quantum simulation method, wherein the method includes:
acquiring data to be calculated;
and calling the quantum bit gate operator to carry the data to be calculated and update the state vector.
A fourth aspect of the present invention provides an intelligent terminal, where the intelligent terminal includes a memory, a processor, and a quantum simulator implementation program stored in the memory and executable on the processor, and the quantum simulator implementation program implements any one of the steps of the quantum simulator implementation method when executed by the processor.
A fifth aspect of the present invention provides a computer-readable storage medium, where a quantum simulator implementation program is stored on the computer-readable storage medium, and when executed by a processor, the quantum simulator implementation program implements any one of the steps of the quantum simulator implementation method.
As can be seen from the above, in the scheme of the present invention, a qubit gate operator is constructed, where the qubit gate operator is used to carry data in a target device based on an operation bit of a qubit gate and perform an attitude vector update; constructing a quantum simulator, wherein the quantum simulator comprises the qubit gate operator, and the quantum simulator is a software package for realizing quantum circuit simulation; deploying the quantum simulator into the target device, wherein the target device is an elevated AI processor. Compared with the prior art, the quantum bit gate operator which can carry data in the rising AI processor and update the attitude vector is obtained in the scheme of the invention, and the quantum simulator which can be deployed in the rising AI processor to run is obtained according to the quantum bit gate operator, thereby being beneficial to fully utilizing the computing capability of the rising AI processor to realize quantum simulation.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a schematic diagram of an AI Core architecture of davinci according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of a quantum simulator implementation method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a calculation flow of a qubit gate operator according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating the step S100 in FIG. 1 according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a calculation flow of an operator according to an embodiment of the present invention;
fig. 6 is a schematic diagram of data pairing in a cusonegate mulstatesdist operator according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of data pairing in a CusOneGateMulStateHigh operator according to an embodiment of the present invention;
fig. 8 is a schematic diagram of data pairing in a cusonegate mulstateslow operator according to an embodiment of the present invention;
fig. 9 is a schematic flowchart of a method for reusing data blocks in an output buffer according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of a quantum simulator implementation apparatus according to an embodiment of the present invention;
FIG. 11 is a flow chart of a quantum simulation method according to an embodiment of the present invention;
fig. 12 is a schematic block diagram of an internal structure of an intelligent terminal according to an embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when …" or "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted depending on the context to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings of the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.
Quantum computing theory has shown its great potential in solving some important problems beyond the capabilities of classical computer computing, for example, in cryptography, financial modeling, and machine learning scenarios. Some quantum computers have been introduced in recent years, such as the 54-qubit computer Sycamore and the 72-qubit computer Bristlelecone of Google, the nine chapters 76-qubit computer of the university of Chinese science and technology, and so on. However, quantum computers remain a valuable resource and cannot be used widely and conveniently. Current quantum computers also have a very large error rate and cannot be used to verify complex quantum algorithms. Furthermore, quantum states collapse after measurement, so not all intermediate data can be collected on a true quantum system. Therefore, quantum circuit simulation is essential to advance quantum computing theory. Quantum computation circuit simulation is essentially a software simulation of the computational process of a real quantum computer, and the software package for implementing quantum circuit simulation is generally called a quantum simulator.
The rising AI processor is an AI processor designed to meet the demand of the current rapidly developing neural network on the chip computation power, and can provide powerful and efficient multiply-add computation power for integer numbers or floating point numbers. Unlike the conventional CPU and GPU, the Shengteng AI processor employs the structure of DaVinci. The da vinci architecture is a completely new AI-oriented computing architecture that is a self-developed. Unlike traditional CPUs and GPUs that support general-purpose computing, and also unlike application-Specific chip ASICs that are dedicated to a particular algorithm, the da vinci Architecture is essentially designed to accommodate common applications and algorithms in a particular Domain, and is often referred to as a "Domain Specific Architecture (DSA)". The computational Core of the helianthus AI processor is mainly composed of AI cores (AI cores) using the DaVinci architecture, and FIG. 1 is a schematic diagram of the Huacheng AI Core architecture according to the embodiment of the present invention. In order to improve the completeness of AI calculation and the calculation efficiency of different scenes, the da vinci architecture further integrates a matrix calculation Unit (Cube Unit), a Vector calculation Unit (Vector Unit), and a Scalar calculation Unit (Scalar Unit). Meanwhile, the method supports multiple precision calculations, supports the data precision requirements of two scenes of training and reasoning, and realizes the full-scene requirement coverage. The AI Core includes a calculating unit, a storing unit, and a controlling unit. The computing unit in the AI Core mainly comprises a matrix computing unit (Cube), a Vector computing unit (Vector) and a Scalar computing unit (Scalar), and different types of data computing in the AI Core are completed. Internal storage exists in the AI Core, and the AI Core needs to load data in the external storage into the internal storage to complete corresponding calculation. The internal storage of the AI Core includes L1 Buffer, L0 Buffer, Unified Buffer, General Purpose Register (GPR, General-Purpose-Bus Register, Special Register (SPR, Special-Purpose-Bus Register, Scalar Buffer, etc.), etc. in order to cooperate with data transmission and handling in the AI Core, the AI Core further includes a Bus Interface Unit (BIU, Bus Interface Unit), a Memory Transfer Unit (MTE, Memory Transfer Engine), etc. wherein the BIU is an Interface between the AI Core and the Bus, the MTE is a data Transfer Unit, which performs data Transfer between different buffers, the Control Unit in the AI Core mainly includes a System Control module (System Control), an instruction transmitting module (instr. dispatch), a matrix operation Queue (Cube Queue), a Vector operation Queue (Vector Queue), a Memory Transfer Queue (MTE), and a command module for transmitting instructions, etc. after the overall command module transmits instructions, and commands, according to different types of instructions, the instructions are sent to a matrix operation queue, a vector operation queue and a storage conversion queue respectively.
Cann (computer Architecture for Neural networks) is a heterogeneous computing Architecture proposed by china for AI scenarios, and is a chip computer library and a highly automated operator development tool. The core of CANN is the highly automated operator development tool, sensor Engine. Through the unified DSL interface, in cooperation with the tool set of high-level template package, automatic performance tuning, etc., the user can conveniently develop the user-defined operator on the Shengteng chip. Meanwhile, CANN already supports all major AI frameworks. The ascendcl (ascend Computing language) provides Device management, Context management, Stream management, memory management, model loading and execution, operator loading and execution, media data processing and other C language API libraries for users to develop deep neural network application, and is used for realizing functions of target identification, image classification and the like. The user can call the AscendCL interface through the third party framework to use the computing power of the soar AI processor; users can also use AscendCl package to implement the third party lib library to provide the running management and resource management capability of the soar AI processor.
In the prior art, quantum computing theory is usually realized through quantum circuit simulation. Quantum computation circuit simulation is essentially a software simulation of the computational process of a real quantum computer, and the software package for implementing quantum circuit simulation is generally called a quantum simulator. The problem in the prior art is that the existing quantum simulator can only run on a CPU or a GPU, and no quantum simulator which can run on an Itanium AI processor exists at present. Users can only use CPU resources to run the traditional quantum simulator on the training server equipped with the soaring AI processor, which is not good for fully utilizing the computing power of the soaring processor and reducing the simulation time.
The most common method of constructing a quantum circuit simulation system is to represent the quantum states as vectors and treat the quantum gates running on the quantum circuit as small matrix vector multiplications. For example, a system consisting of n qubits has a total of 2nOne possible state, in a quantum simulator, requires the storage of the complex number (alternatively referred to as the amplitude) of the corresponding ground state contained in each possible state. If two single precision floating points are used in the quantum simulator to represent the complex number, then simulating a30 qubit quantum system consumes 8GB of memory. An n-qubit system, which requires a pair of 2 for the operation of a single-qubit gate on the mth qubitnThe states operate, all the states are paired two by two, and the small matrix vector multiplication has no data dependency, so that the small matrix vector multiplication can be parallelized, as shown in the following formula (1):
Figure BDA0003248766280000081
wherein, the qubit gate can be expressed as a unitary matrix with the scale of 2 x 2, and each element of the unitary matrix is a complex number, namely A00、A01、A10And A11. The system of n qubits has a total of 2nA possibility ofStates, called state vectors, using a length of 2nThe elements of the array are respectively
Figure BDA0003248766280000083
αiAnd alphajRespectively representing the ith and jth elements in the array. In formula (1), the left matrix of the equation is a unitary qubit gate matrix, the vector on the left of the equation is an element of a state vector before one qubit gate is applied, and the right of the equation is an element of a state vector updated after one qubit gate is applied. i and j satisfy the following formula (2):
Figure BDA0003248766280000082
the conditions that i and j in formula (1) need to satisfy in the calculation are given in formula (2). m represents the qubit currently in operation No. m, and the subscripts of the two paired state elements differ by 2mThe first line of equation (2) gives the constraint of i, i does not exceed the upper and lower limits of the array and the remainder obtained by dividing the number shifted to the right by m bits by 2 in the binary representation of i is 0. For example, in a 4-qubit system, if m is 2, then i is 0 (binary denoted 0000), and j is 4 (binary denoted 0100); when i is 1 (binary representation is 0001), j is 5 (binary representation is 0101).
In order to solve the problems of the prior art, the present invention provides a method for implementing a quantum simulator, which, in the embodiment of the present invention, provides a set of system-feasible quantum simulator implementation and optimization schemes based on a rising AI processor, aiming at the features and programming characteristics of the davinci architecture of the rising AI processor, and combining the features of a quantum simulation program, so as to achieve the full utilization of AI Core computing resources. And the method provides a universal realization and optimization method for developers who develop, transplant or optimize the quantum simulator program aiming at the soar AI processor, and has certain guiding significance for the efficient application of the quantum simulator program in other high-performance AI computing platforms.
As shown in fig. 2, an embodiment of the present invention provides a quantum simulator implementation method, and specifically, the method includes the following steps:
and S100, constructing a quantum bit gate operator, wherein the quantum bit gate operator is used for carrying data in target equipment and updating an attitude vector based on an operation bit of a quantum gate.
In this embodiment, the qubit gate operator is a function that can be compiled and run in a target device. The quantum bit gate operator is used for carrying out single quantum gate bit gate operation on the quantum bit.
Step S200, constructing a quantum simulator, wherein the quantum simulator includes the qubit gate operator, and the quantum simulator is a software package for implementing quantum line simulation.
Specifically, a software package capable of being deployed, compiled and executed in the target device is generated based on the packing of the qubit gate operator, and is used as the quantum simulator. The quantum simulator includes the qubit operator and other data such as functions and instructions for supporting the qubit gate operator to be normally used in the target device, and specific data may be set and adjusted according to actual requirements, which is not specifically limited herein.
Step S300, the quantum simulator is deployed in the target device, which is an Itanium AI processor.
Specifically, the quantum simulator is compiled and run in the target device, so that the target device can call the qubit gate operator to carry out data transfer and state vector update, thereby realizing the simulation of the quantum line. In this embodiment, the target device is an Itanium AI processor. In practical use, the target device may also be another processor or device, and is not limited specifically herein.
As can be seen from the above, in the quantum simulator implementation method provided in the embodiment of the present invention, a qubit gate operator is constructed, where the qubit gate operator is used to carry data in a target device based on an operation bit of a qubit gate and perform an attitude vector update; constructing a quantum simulator, wherein the quantum simulator comprises the qubit gate operator, and the quantum simulator is a software package for realizing quantum circuit simulation; deploying the quantum simulator into the target device, wherein the target device is an elevated AI processor. Compared with the prior art, the quantum bit gate operator which can carry data in the rising AI processor and update the attitude vector is obtained in the scheme of the invention, and the quantum simulator which can be deployed in the rising AI processor to run is obtained according to the quantum bit gate operator, thereby being beneficial to fully utilizing the computing capability of the rising AI processor to realize quantum simulation.
Specifically, in this embodiment, as shown in fig. 3, the qubit gate operator performs quantum simulation based on the following steps:
step A100, obtain the type of data to be calculated and the core number and buffer size of the soar AI processor.
Step A200, starting multi-core processing based on the core number and dividing the attitude vector corresponding to the data to be calculated.
And step A300, carrying the divided state vectors from the global memory to an output buffer area, and then performing matrix vector multiplication operation corresponding to quantum bit gate operation to obtain calculated data.
Step a400, the calculated data is transferred from the output buffer to the global memory.
The data to be calculated is data which needs to be subjected to quantum simulation, and the type of the data to be calculated can be version precision floating point number, single precision floating point number and the like. On the rising AI processor, since the Memory output Buffer (unified Buffer) inside the AI Core is separated from the Memory Global Memory (Global Memory) outside the AI Core, data computation in the AI Core requires that data be first transferred from the Global Memory to the unified Buffer, then computed in the AI Core, and finally transferred back to the Global Memory.
Specifically, in this embodiment, the qubit gate operator includes a distributed single-qubit gate operator, a low-order single-qubit gate operator, and a high-order single-qubit gate operator, which correspond to different operation bits, respectively. In this embodiment, as shown in fig. 4, the step S100 specifically includes:
step S101, a distributed high operation bit range, a low operation bit range, and a non-distributed high operation bit range are obtained based on the quantum line and the target device.
Wherein the operation bits in the distributed high operation bit range are operation bits that require distributed communication, the address distance corresponding to the operation bits in the low operation bit range is smaller than the minimum data-carrying dimension of the rising AI processor, the address distance corresponding to the operation bits in the non-distributed high operation bit range is not smaller than the minimum data-carrying dimension of the rising AI processor, and no distributed communication exists.
And step S102, respectively constructing a distributed single-quantum-bit gate operator corresponding to the distributed high-operation bit range, a low-order single-quantum-bit gate operator corresponding to the low-operation bit range, and a high-order single-quantum-bit gate operator corresponding to the non-distributed high-operation bit range.
According to the height of the quantum gate operation bit, the quantum bit gate operator needs the three operators to perform specific calculation.
Specifically, if distributed computation is performed, a high-order quantum bit to be communicated is operated, two state vector segments (i.e., a local state vector and a paired state vector) are obtained after communication, the distributed single-quantum-bit gate operator is called to complete one-time single-gate operation, and the distributed single-quantum-bit gate operator may be named as cusonegate mulstatesdist. The CusOneGateMulStateDist operator has four inputs and one output, the first input tensor is a local state vector (input _ x) with the length of 2 x 2nThe real part and the imaginary part of the complex number are separately stored (the real part and the imaginary part are both stored on a global memory of the rising AI processor, the separate storage means two arrays, the first array comprises the real parts of all local elements, the second array comprises the imaginary parts of all local elements), the second input tensor is an attitude vector segment (input _ y) paired with the local attitude vector obtained by communication, the third input tensor is quantum gate matrix data (input _ m), and the fourth tensor is quantum gate matrix data (input _ m)In formula (1), the local state vector is located in the judger of the 0 th element or the 1 st element of the vector when multiplied by the gate matrix, and one output tensor is the calculation result (output). The data types and data sizes of input _ x, input _ y and output should be consistent. The communication refers to the communication between different soar AI processors, for example, the first soar AI processor obtains the data of the third soar AI processor through communication in the HCCL, i.e., the aggregate communication library. The above-mentioned quantum gate matrix data (input _ m) is a quantum gate expressed in a matrix, depending on what kind of quantum gate calculation is to be performed on the data, e.g. the matrix data corresponding to Pauli-X gate is
Figure BDA0003248766280000111
The vector referred to above refers to the vector to the left of the equation of equation (1). In one application scenario, the data of the first soaring AI processor and the data of the third soaring AI processor are paired sequentially. The data index of the first rising AI processor is smaller than that of the third rising AI processor, i.e. j ═ i +2 in formula 2mDuring the calculation, the data of the first boosted AI processor is put at alphaiAt the third rising AI processor, the data is placed at alphajTo (3). The above-mentioned judger may be boolean type or integer type. The calculation result (output) is obtained by performing the calculation shown in formula (1) on the data, and the state vector is updated by multiplying the state vector pair by the quantum gate matrix.
Wherein, the specific number of bits (i.e. the distributed high operation bit range) corresponding to the high bits can be determined or set according to actual requirements, for example, assuming a line with 20 qubits, the length of the state vector is 220. If 8 soar AI processors are used to participate in the computation, each soar AI processor needs to store 217A plurality of complex numbers. The first hpotate AI processor needs to store 0, 1, 2, … …, 217-element No. 1; the second processor needs to store 217,217+1,217+2,……,218Element number 1, followed by analogy.
If it is to be operatedIf 18 qubits (counted from 0) are made, i.e. m in equation (2) is 18, then the relationship between i and j in equation (2) is j i +218. i ═ 0 element requirement and j ═ 218 Element number 0 within the first History AI processor, 218The number element is inside the third helixed AI processor, so the first helixed AI processor and the third helixed AI processor need to exchange data (i.e., distributed communications) before computation. At this time, the m-quantum bit is called an operation bit needing communication, and a distributed single-quantum-bit gate operator CusOneGateMulStatsDist is called. That is, the condition for invoking the distributed single-qubit gate operator is that the qubit to be operated on is an operation bit to be communicated. Two state vector segments obtained after communication are input data which need to be prepared before an operator is called. In one application scenario, the first helixed AI processor has its own stored 0, 1, 2, … …, 2 in addition to its own17Element number 1, also storing 2 obtained from the third boosted AI processor18,218+1,218+2,……,218+217-element number 1. Since the real and imaginary parts of the complex numbers are stored separately, the four have a length of 2 as viewed from the inside of the program17An array of (2). The first array is 0, 1, 2, … …, 217The real part of the complex number of element number 1, the second array being 0, 1, 2, … …, 217The imaginary part of the complex number of element number 1, the third array being 218,218+1,218+2,……,218+217The real part of the complex number of element number 1, the fourth array being 218,218+1,218+2,……,218+217The imaginary part of the complex number of element number 1. The elements of the same subscript of each array are paired at the time of computation.
The minimum data-handling dimension of the Itanium AI processor is 32 bytes, corresponding to 16 half-precision floating-point numbers. If the operation bit is equal to 0, 1, 2 or 3, and the distance between any two elements needing pairing is less than 32 bytes, a low-bit single-quantum-bit gate operator is called. The other operation bits invoke high order single qubit gate operators.
To the upperThe vector of local states (input _ X), "local" is a concept of distributed computation, meaning that if a piece of data is stored on X (e.g., the first Itanium AI processor), then this piece of data is local to X. In the above examples, 0, 1, 2, … …, 217Element number-1 is the local data of the first helistat AI processor, but this piece of data is not local to the third helistat AI processor.
And if the address distance of the state needing to be paired is greater than or equal to 32 bytes and distributed communication does not exist, calling the high-order single-quantum-bit gate operator to complete one-time single-gate operation. The above-mentioned high single-qubit gate operator may be named cusonegate mulstateshigh. The CusOneGateMulStateHigh operator has three inputs, one output, the first input tensor is a state vector (input _ x), the second input tensor is quantum gate matrix data (input _ m), the third tensor is the offset of a pairing state, and the output tensor is a calculation result (output). For the mth operation bit, the offset is equal to 2mSee equation (2). Wherein the data type and data size of input _ x and output should be consistent.
The state vector is an array, and the states are elements in the state vector and are numbers. For example, when the operation bit is 2, 0 and 4 pairs, 1 and 5 pairs, 2 and 6 pairs, 3 and 7 pairs, 8 and 12 pairs, etc., the numbers herein are subscripts of elements. "Address" is a concept on a computer memory unit, and an address can be represented by a 16-ary number, indicating the location of this data on the memory.
The mth operation bit is the mth quantum bit in the quantum wire, 2nThe binary subscripts of the states represent the state pairs of 00 … … 00, 00 … … 01, 00 … … 10, 00 … … 11, … …, 11 … … 11, subscripts xx … xx0xx … xx and xx … xx1xx … xx in sequence, i.e. the states with different operation bit numbers m but the other operation bit numbers are the same. Offset 2mEqual to the difference of xx … xx1xx … xx and xx … xx0xx … xx.
If the address distance of the state needing to be paired is less than or equal to 32 bytes, the low-order single-quantum-bit gate operator (named as CusOneGateMulStateLow) can be called to complete onceSingle door operation. The CusOneGateMulStateLow operator has three inputs, one output, the first input tensor is a state vector (input _ x), the second input tensor is quantum gate matrix data (input _ m), the third tensor is the offset of a pairing state and a sel character required by assembling and distributing data, and the output tensor is a calculation result (output). For the mth operation bit, the offset is equal to 2mSee equation (2). Wherein the data type and data size of input _ x and output should be consistent. The corresponding meanings of the parameters with the same representation form in the operators are the same, and are not described herein again.
The three operators described above have similar operating steps (as described above in steps a100 to a400), only the method of handling and assembling the data differs. Fig. 5 is a schematic diagram of an operator calculation process provided in an embodiment of the present invention, and a specific process is as follows:
first, the type of input data is obtained. One half-precision floating point number occupies 2 bytes and one single-precision floating point number occupies 4 bytes. Different parameters need to be set in a data carrying data _ move instruction, a vector multiplication vec _ mul instruction and a vector addition vec _ add instruction used by the invention according to data types.
And secondly, acquiring the core number and the size of the buffer area. In the prototype definition of for _ range (a function interface of TIK), a user realizes Core parallel by setting a parameter block _ num, and the user can acquire the number of AI cores through a get _ soc _ spec interface, taking an Ascend 910 processor as an example, a chip contains 32 AI cores. The Buffer Unified Buffer is a target storage space corresponding to the input and output of vector and scalar calculation, and the size of the Buffer Unified Buffer is obtained by the user through the get _ Unified _ Buffer _ size function.
And thirdly, starting multi-core and dividing input _ x according to the obtained core number. The different types of processor types of the Lift AI processor have different AI cores, and for the lift 910 processor, there are 32 AI cores for an NPU. Specifically, the data stored in the buffer is divided and evenly distributed to each core, and each core only processes the data distributed to itself.
The fourth step, on the rising AI processor, due to the internal memory of the AI CoreThe storage of the Unitfield Buffer is separate from the storage of the Global Memory outside the AI Core, and the calculation of data in the AI Core requires that the data be transferred from the Global Memory to the Unitfield Buffer first, then calculated in the AI Core, and finally transferred back to the Global Memory. For the turbo 910 processor, the size of the Unitfield Buffer is not more than 256KB, so the state vector array in Global Memory needs to be copied into the Unitfield Buffer in segments. And calculating the data amount of each copy according to the acquired Buffer size, and carrying the data from the Global Memory to a universal Buffer by using a data _ move instruction, wherein the universal Buffer requires the offset address 32Byte to be aligned. One float16 and float32 occupy 2 bytes and 4 bytes, respectively, in the memory, and the length of the data _ move continuous transmission data fragment is 16 floats 16 or 8 floats 32 at the shortest. In an application scenario, there is a total of 2nA plurality of floating point numbers are stored in a global memory, and the data size is 4GB when n is 30. The buffer area unified buffer is only 248kb, and the buffer area unified buffer cannot be stored, so that a part of data is transferred each time, and the data of the next fragment is transferred after calculation. Thus, the amount of data per copy is less than the buffer size.
Fifthly, a Vector computing unit Vector of the davinci architecture is responsible for executing Vector operations, and all computed source data and target data of the Vector are required to be stored in an output Buffer (unified Buffer) and are required to be aligned by 32 bytes. The data carried to the unified Buffer in the fourth step is subjected to matrix-vector multiplication operation of gate operations as shown in equation (1) by the vec _ muls, vec _ add and vec _ sub instructions.
And sixthly, carrying the calculation result of the fifth step from the unified Buffer to the Global Memory by using a data _ move instruction. If the calculation of all the data is finished, the operation of the operator is finished; otherwise, returning to the fourth step, and operating the next data segment.
The distributed single-quantum-bit gate operator, the low-order single-quantum-bit gate operator and the high-order single-quantum-bit gate operator are mainly different in data carrying and assembling methods. As shown in equation (1), the height of the different qubit gate operation bits determines the distance between the two pairs of states on the memory.
In this embodiment, the distributed single-quantum-bit gate operator performs data transfer based on the following steps: acquiring data to be calculated, wherein the data to be calculated comprises a local state vector and a pairing state vector, the pairing state vector is a state vector which is obtained through distributed communication and is paired with the local state vector, and the local state vector and the pairing state vector are located in different helistat AI processors; and circularly carrying the local state vector and the paired state vector based on the length of the local state vector and a preset segment length, carrying data segments with the same position and the same length in the local state vector and the paired state vector from the global memory to the output buffer area for calculation each time of circulation, and carrying the calculated data from the output buffer area to the global memory after the calculation is finished.
The preset segment length can be set and adjusted according to actual requirements. Specifically, in distributed computing, if the operation is located at the top several positions, the paired state is located on two different shanks AI processors, and the two shanks AI processors need to exchange data before computing. For example, for a30 qubit quantum system, 30 qubits are numbered from 0 to 29, and the attitude vector can be expressed as a total length of 230The quantum wire simulation is performed using four (by way of example only) soar AI processors, each storing 2 on a Global Memory28And if the operation bit of the quantum gate is 29, the processors No. 0 and No. 2 and the processors No. 1 and No. 3 need to exchange data, the local state vector is marked as input _ x, the state vector obtained by communication is marked as input _ y, and then the CusOneGateMulStatsDist operator is called. Fig. 6 is a schematic diagram of data pairing in the cusonegate mulstatesdist operator according to an embodiment of the present invention. The n element of input _ x and the n element of input _ y are paired, and each time the loop is executed, data fragments with the same position and length of input _ x and input _ y are copied into a unified Buffer from Global Memory to obtain two unified Buffer arrays of ub0 and ub1, which correspond to alpha of formula (1)iAnd alphajCompletion of the calculationThen copy output from the Unitfield Buffer to the original address of the Global Memory. Wherein, the original address is the original address of input _ x. In an application scenario, output may also be copied to another preset address of the archive result, but more memory space is occupied, so that the original address is copied back in the embodiment, and space occupation is reduced.
The high-order single-quantum-bit gate operator carries out data transportation based on the following steps: acquiring data to be calculated, wherein the data to be calculated is a local attitude vector, and the local attitude vector comprises two paired attitude vectors which are positioned in the same helianthus AI processor and have a distance not less than the minimum data handling dimension of the helianthus AI processor; acquiring the length of a continuous transmission data segment and the interval of adjacent continuous data segments; and loading data from different positions of the local state vector in a global memory to the output buffer area for calculation based on the length of the continuous transmission data segment and the interval of the adjacent continuous data segments, and carrying the calculated data from the output buffer area to the global memory after the calculation is finished.
Fig. 7 is a schematic diagram of data pairing in the cusonegate mulstateshigh operator according to an embodiment of the present invention. If the simulator uses single-precision floating point and the operation bit of the gate is greater than or equal to 3, the distance between the two paired states on the memory is greater than or equal to 32 bytes, namely the minimum size of the data _ move operation (for half-precision floating point, the operation bit of the gate is greater than or equal to 4). Setting the length of the continuous transmission data fragment and the interval of the adjacent continuous data fragment, calling data _ move to load data from different positions of the local state vector array to the Unitfield Buffer, and copying output from the Unitfield Buffer to the original address of the Global Memory after calculation is finished. The length of a continuous transmission data segment is the length of continuous data, and the number of bytes of an adjacent continuous data segment interval is the interval between two continuous transmission data.
The low-order single-quantum-bit gate operator carries out data transportation based on the following steps: acquiring data to be calculated, wherein the data to be calculated is a local attitude vector, and the local attitude vector comprises two paired attitude vectors which are positioned in the same helianthus AI processor and have a distance smaller than the minimum data handling dimension of the helianthus AI processor; loading data into the output buffer area from the 0 th element of the local state vector in the global memory to obtain a zero-th array; loading data into the output cache region from the No. 1 element of the local state vector in the global memory to obtain a first array; loading data into the output cache region from the-1 element of the local state vector in the global memory to obtain a second array; after the third array is obtained by calculating the upper half part of the unitary matrix based on the zero array and the first data, and the fourth array is obtained by calculating the lower half part of the unitary matrix based on the second array and the zero array, the third array and the fourth array are selected according to bits to obtain a fifth array, and the data in the fifth array are carried to the global memory from the output buffer area.
If the operation bits are in the lowest few positions (e.g., 0, 1, 2, 3 bits for half-precision floating point; 0, 1, 2 bits for single-precision floating point), the distance between the paired states on the memory is less than 32 bytes, i.e., the minimum size of the data _ move operation. Taking the operation number 0qubit as an example, the offset of the pair state is 1, and if single precision is used, the distance of the pair state on the memory is 4 bytes. The No. 0 element and the No. 1 element of the local state vector are paired, the No. 2 element and the No. 3 element are paired, the No. 4 element and the No. 5 element are paired, and the following elements are paired in sequence. The calculation uses vector calculation units, elements 0, 2, 4 and … … need to be carried to the ub0 array, and elements 1, 3, 5 and … … need to be carried to the ub1 array. As data _ move does not support selective data transfer, and it is not efficient to transfer data one by one, as shown in fig. 8, the present invention proposes the following method for the cusonagatemustestelow operator to transfer data:
1) the data obtained by loading data into the unified Buffer, ub0 (i.e., the zeroth array) starting from element 0 of the state vector includes: … … elements No. 0, No. 1, No. 2, No. 3, No. 4, No. 5;
2) the data obtained by loading data into the unitified Buffer, ub1 (i.e., the first array) starting with element 1 of the state vector includes: … … elements No. 1, No. 2, No. 3, No. 4, No. 5, No. 6;
3) the data obtained by loading data into the unitified Buffer, ub2 (i.e., the second array) starting with element-1 of the state vector includes: xx, element … … of 0, 1, 2, 3, and 4 (xx represents useless data), because Global Memory does not support negative address access, several blank data are added in front of the array when creating the Global Memory state vector;
4) substituting the three ub arrays into formula (1) to participate in calculation, and using ub0 and ub1 to participate in calculation of the upper half of the matrix, so as to obtain ub3 (namely, a third array), wherein data in ub3 are elements 0, xx, 2, xx, 4 and xx after calculation; ub2 and ub0 participate in the calculation of the lower half of the matrix to obtain ub4 (i.e., the fourth array), and the data in ub4 are xx, 1, xx, 3, xx, 5 elements after calculation;
5) calling a vec _ sel function, selecting bits of ub3 and ub 4by using a sel character (for example, selecting from ub3 when the bit of sel is 1, and selecting from ub4 when the bit is 0), obtaining ub5 (namely, a fifth array), wherein ub5 contains data of 0, 1, 2, 3, 4 and 5 … … elements after calculation;
6) the data in ub5 is copied back to the Global Memory origin.
It should be noted that, the data of the gate matrix in the formula (1) is not necessarily all non-zero elements, and for the matrix containing zero elements, the calculation flows of the three operators can be simplified, and data handling and floating point calculation operations related to the zero elements can be omitted. For a single-gate quantum random line, three operators are selected and called in sequence according to the heights of all quantum gate operation bits, and then quantum line simulation can be achieved.
It is challenging to build a high performance quantum circuit simulator on the soar AI platform. The large number of small matrix multiplications used to model the various quantum gates results in poor locality of data, further resulting in low computational resource utilization. In order to improve the data locality, the invention provides a data Blocking reuse method (UB Blocking method) of an output Buffer, which repeatedly utilizes an efficient Unified Buffer storage unit and reduces the carrying times of Global Memory data with lower bandwidth.
The UnifieldBuffer (UB) is used to store the input and output of vector and scalar computing units, and for the turbo 910 processor, the available size of the UnifieldBuffer is not more than 256KB, and the single precision floating point can be stored in not more than 216And (4) respectively. In actual calculation, UB stores not only input/output state vector segments but also temporary intermediate variables, so that the state vector length of each transfer is 212A single precision complex number.
Fig. 9 is a schematic flowchart of a method for reusing data blocks in an output buffer according to an embodiment of the present invention. In particular, consider a quantum wire consisting of n qubits and a gate set comprising a plurality of gates in which all gates operate only up to k (0)<k<N) qubits, of length 2, which can be extracted from the state vectorkThe complex number set is only related to the data in the complex number set when the gate operation in the gate group is applied, and the whole attitude vector can be divided into 2(n-k)A collection of such mutually independent complex numbers. Calculating the length to be 2kThe complex set of (A) is transported into a Unitfield Buffer, then all the doors in the door group are applied at one time, and the complex set of (B) is transported back to the Global Memory after calculation. Compared with the conventional calculation process, the UB Blocking method reduces the data transfer times between the Global Memory and the Unitfied Buffer. For example, each door is handled 2 based on conventional methodsnA plurality of gates, a gate group comprising m gates, the total amount of data transferred being m x 2nA plurality of complex numbers. By adopting the method, the data are carried according to the complex number set, and the total carried data volume is 2nThe complex number is reduced to 1/m, so that the calculation time can be reduced.
Further, due to the size limitation of the unitified Buffer, for single precision floating point, there are at most 12 quantum operation bits in a gate set; since the minimum operation size of the data _ move data transfer command is 32 bytes, the gate set must include No. 0-3 sub-operation bits, and any other 8 sub-operation bits may be used. A quantum wire can be converted into groups of gates, and in some cases the order of the gates can be reversed, which helps to reduce the number of groups of gates.
It should be noted that the idea of the data handling and calculation method of the qubit gate operator is still applicable to the implementation of the controlled single-qubit gate and dual-qubit gate calculations.
The invention provides a general realization and optimization method for a quantum simulator single gate operator of an Itanium AI processor, which better exerts the structure and performance advantages of the Itanium AI processor by combining a bottom TIK development mode with optimization methods such as multi-core and Unitfield Buffer. The following table shows the performance comparison between the rising quantum simulator and the similar open source GPU simulator, in which the calculations and simulations are performed using single-precision floating point numbers except for the qulacs simulator. The second column represents the operator run time of the boosted quantum simulator and the gate operation kernel time of the same type of GPU simulator. The simulator runs a 29-qubit, U1-single-gate quantum random line, the rising quantum simulator uses a rising 910 machine, the QuEST and qcgpu use an engida Tesla V100GPU, each simulator uses a single card for test comparison, and the test performance data results are as follows:
Figure BDA0003248766280000181
therefore, the performance of the soar quantum simulator is improved by several times compared with the performance of the similar GPU simulator. The invention provides data transfer and operation details of a quantum simulator capable of running on the rising AI processor aiming at the characteristics and programming characteristics of the DaVinci architecture of the rising AI processor and combining the characteristics of a quantum simulation program, and further provides a UB Blocking method, and multi-core optimization of the rising AI processor is utilized to realize full utilization of AICore computing resources; a general implementation and optimization method is provided for developers who develop, transplant or optimize quantum simulator programs for the Itanium AI processor, which is not only suitable for the implementation of the single quantum gate described in the present invention, but also suitable for the implementation of the controlled single quantum bit gate and the dual quantum bit gate. The quantum simulator of the invention for the soar AI processor is fully tested and used, and the soar 910 processor is used to encapsulate the operator provided by the invention by two ways of adding custom operator by Mindspore and applying and developing by AscendCL. The quantum simulator has completed the realization and the test of the single quantum bit gate, and has compared the efficiency with other GPU quantum simulators of the same type, and the invention is feasible and has obvious effect according to the table.
It should be noted that the rising AI chip has three basic computing resources, namely a matrix computing Unit (Cube Unit), a Vector computing Unit (Vector Unit), and a Scalar computing Unit (Scalar Unit). The present invention utilizes the vector calculation unit of the rising AI chip to perform the quantum circuit simulation. The matrix calculation unit may perform a matrix multiplication of two half-precision floating points of size 16 x 16. If a gate group only operates no more than 4 qubits, then this gate group can be fused into a4 qubit gate, represented by a 16 x 16 matrix, and the matrix is multiplied by a corresponding vector composed of 16 states to obtain an updated vector, the matrix vector multiplication operation has multiple groups and no data dependency, 16 groups of vectors can constitute a 16 x 16 matrix, and then the simulation of the quantum circuit can be completed by using the matrix calculation unit. The operation thought is as follows:
1) the quantum wires are analyzed, the sequence of sub-gates is rearranged and divided into a number of gate groups, each gate group containing a number of sub-gates, a gate group operating only no more than 4 qubits. Each gate set may be represented by a 16 x 16 matrix.
2) Assuming that a gate set contains operation bits i, j, k, l, the 15 states in the state vector paired with element number 0 are: 2i、2j、2i+2j、2k、2k+2i、2k+2j、2k+2i+2j… … form a vector with 16 states and are multiplied by a 16 x 16 gate matrix.
3) The matrix calculation unit only supports two 16 × 16 matrix multiplications, and cannot perform the matrix vector multiplication represented in 2). The paired states of the disposable package 16 groups can form a 16-by-16 state matrix which can be combined with the matrixThe gate matrix multiplication. If the qubits of the gate group operation are all greater than or equal to 4, a data handling method similar to the cusonegate mulstateshigh operator can be adopted, and the number 0, 1, and 2 … … 15 elements are placed in the first row of the state matrix, and 2i、2i+1、2i+2……2iIf the specific quantum ratio of the operation of the gate group is less than 4, the element number +15 is placed in the second row … … of the state matrix, a data carrying method similar to a cusonegate mulstateslow operator can be adopted, a data carrying instruction with an offset is used, and finally a selection instruction is used for extracting the result.
As shown in fig. 10, corresponding to the quantum simulator implementation method, an embodiment of the present invention further provides a quantum simulator implementation apparatus, where the quantum simulator implementation apparatus includes:
and an operator constructing module 410, configured to construct a qubit gate operator, where the qubit gate operator is used to carry data in the target device and perform an attitude vector update based on an operation bit of the qubit gate.
In this embodiment, the qubit gate operator is a function that can be compiled and run in a target device. The quantum bit gate operator is used for carrying out single quantum gate bit gate operation on the quantum bit.
And a quantum simulator constructing module 420, configured to construct a quantum simulator, where the quantum simulator includes the qubit gate operator, and the quantum simulator is a software package for implementing quantum line simulation.
Specifically, a software package capable of being deployed, compiled and executed in the target device is generated based on the packing of the qubit gate operator, and is used as the quantum simulator. The quantum simulator includes the qubit operator and other data such as functions and instructions for supporting the qubit gate operator to be normally used in the target device, and specific data may be set and adjusted according to actual requirements, which is not specifically limited herein.
A deployment module 430 for deploying the quantum simulator to the target device, the target device being an elevationally AI processor.
Specifically, the quantum simulator is compiled and run in the target device, so that the target device can call the qubit gate operator to carry out data transfer and state vector update, thereby realizing the simulation of the quantum line. In this embodiment, the target device is an Itanium AI processor. In practical use, the target device may also be another processor or device, and is not limited specifically herein.
It should be noted that, the specific functions corresponding to the quantum simulator implementation apparatus and the specific modules thereof may be set and adjusted by referring to the quantum simulator implementation method, and are not described herein again.
As shown in fig. 11, corresponding to the quantum simulator implementation method, an embodiment of the present invention further provides a quantum simulation method, where the method includes:
and B100, acquiring data to be calculated.
And step B200, calling the qubit gate operator to carry the data to be calculated and updating the state vector.
It should be noted that, a specific quantum simulation process may refer to a specific calculation flow in the quantum simulator implementation method, and details are not described herein again.
Based on the above embodiments, the present invention further provides an intelligent terminal, and a schematic block diagram thereof may be as shown in fig. 12. The intelligent terminal comprises a processor, a memory, a network interface and a display screen which are connected through a system bus. Wherein, the processor of the intelligent terminal is used for providing calculation and control capability. The memory of the intelligent terminal comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a quantum simulator implementation program. The internal memory provides an environment for an operating system and a quantum simulator in the non-volatile storage medium to implement the running of the program. The network interface of the intelligent terminal is used for being connected and communicated with an external terminal through a network. The quantum simulator implementation program realizes the steps of any one of the quantum simulator implementation methods when being executed by a processor. The display screen of the intelligent terminal can be a liquid crystal display screen or an electronic ink display screen.
It will be understood by those skilled in the art that the block diagram of fig. 12 is only a block diagram of a part of the structure related to the solution of the present invention, and does not constitute a limitation to the intelligent terminal to which the solution of the present invention is applied, and a specific intelligent terminal may include more or less components than those shown in the figure, or combine some components, or have different arrangements of components.
In one embodiment, an intelligent terminal is provided, where the intelligent terminal includes a memory, a processor, and a quantum simulator implementation program stored in the memory and executable on the processor, and the quantum simulator implementation program performs the following operation instructions when executed by the processor:
constructing a quantum bit gate operator, wherein the quantum bit gate operator is used for carrying data in target equipment based on an operation bit of a quantum gate and updating an attitude vector;
constructing a quantum simulator, wherein the quantum simulator comprises the qubit gate operator, and the quantum simulator is a software package for realizing quantum circuit simulation;
deploying the quantum simulator into the target device, wherein the target device is an elevated AI processor.
The embodiment of the present invention further provides a computer-readable storage medium, where a quantum simulator implementation program is stored on the computer-readable storage medium, and when the quantum simulator implementation program is executed by a processor, the quantum simulator implementation program implements the steps of any one of the quantum simulator implementation methods provided in the embodiments of the present invention.
Optionally, the intelligent terminal and the computer-readable storage medium may also store a quantum simulator implementation program to implement the steps of the quantum simulator implementation method.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned functions may be distributed as different functional units and modules according to needs, that is, the internal structure of the apparatus may be divided into different functional units or modules to implement all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art would appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the above modules or units is only one logical division, and the actual implementation may be implemented by another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.
The integrated modules/units described above, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium and can implement the steps of the embodiments of the method when the computer program is executed by a processor. The computer program includes computer program code, and the computer program code may be in a source code form, an object code form, an executable file or some intermediate form. The computer readable medium may include: any entity or device capable of carrying the above-mentioned computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signal, telecommunication signal, software distribution medium, etc. It should be noted that the contents contained in the computer-readable storage medium can be increased or decreased as required by legislation and patent practice in the jurisdiction.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art; the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein.

Claims (13)

1. A method for implementing a quantum simulator, the method comprising:
constructing a quantum bit gate operator, wherein the quantum bit gate operator is used for carrying data in target equipment and updating an attitude vector based on an operation bit of a quantum gate;
constructing a quantum simulator, wherein the quantum simulator comprises the quantum bit gate operator, and the quantum simulator is a software package for realizing quantum line simulation;
deploying the quantum simulator into the target device, the target device being a soar AI processor.
2. The quantum simulator implementation of claim 1, wherein the qubit gate operator is configured to perform a single qubit gate operation on a qubit.
3. The quantum simulator implementation of claim 2, wherein the qubit gate operator performs quantum simulation based on the steps of:
obtaining the type of data to be calculated and the core number and the buffer size of the rising AI processor;
starting multi-core processing based on the core number and dividing the state vector corresponding to the data to be calculated;
carrying the divided state vectors to an output buffer area from a global memory, and then carrying out matrix vector multiplication operation corresponding to quantum bit gate operation to obtain calculated data;
and carrying the calculated data from the output buffer to the global memory.
4. The quantum simulator implementation of claim 3, wherein the qubit gate operators comprise a distributed single-qubit gate operator, a low-order single-qubit gate operator, and a high-order single-qubit gate operator, each corresponding to a different operation bit.
5. The quantum simulator implementation of claim 4, wherein the constructing the qubit gate operator comprises:
acquiring a distributed high operation bit range, a low operation bit range and a non-distributed high operation bit range based on the quantum line and the target device, wherein the operation bits in the distributed high operation bit range are operation bits which need to perform distributed communication, the address distance corresponding to the operation bits in the low operation bit range is smaller than the minimum data handling scale of the rising AI processor, the address distance corresponding to the operation bits in the non-distributed high operation bit range is not smaller than the minimum data handling scale of the rising AI processor, and no distributed communication exists;
and respectively constructing a distributed single-quantum-bit gate operator corresponding to the distributed high-operation bit range, a low-order single-quantum-bit gate operator corresponding to the low-operation bit range and a high-order single-quantum-bit gate operator corresponding to the non-distributed high-operation bit range.
6. The quantum simulator implementation of claim 5, wherein the distributed single qubit gate operator performs data handling based on:
acquiring data to be calculated, wherein the data to be calculated comprises a local state vector and a pairing state vector, the pairing state vector is a state vector paired with the local state vector obtained through distributed communication, and the local state vector and the pairing state vector are located in different shangteng AI processors;
and circularly carrying the local state vector and the paired state vector based on the length of the local state vector and a preset segment length, carrying data segments with the same position and the same length in the local state vector and the paired state vector from the global memory to the output buffer area for calculation each time of circulation, and carrying the calculated data from the output buffer area to the global memory after the calculation is finished.
7. The quantum simulator implementation of claim 5, wherein the high order single qubit gate operator performs data handling based on the steps of:
acquiring data to be calculated, wherein the data to be calculated is a local attitude vector, and the local attitude vector comprises two paired attitude vectors which are positioned in the same helianthus AI processor and have a distance not less than the minimum data handling dimension of the helianthus AI processor;
acquiring the length of a continuous transmission data segment and the interval of adjacent continuous data segments;
and loading data from different positions of the local state vector in a global memory to the output buffer area for calculation based on the length of the continuous transmission data segment and the interval of the adjacent continuous data segments, and carrying the calculated data from the output buffer area to the global memory after the calculation is finished.
8. The quantum simulator implementation of claim 5, wherein the low order single qubit gate operator performs data handling based on the steps of:
acquiring data to be calculated, wherein the data to be calculated is a local attitude vector, and the local attitude vector comprises two paired attitude vectors which are positioned in the same helianthus AI processor and have a distance smaller than the minimum data handling dimension of the helianthus AI processor;
loading data into the output buffer area from the No. 0 element of the local state vector in the global memory to obtain a zeroth array;
loading data into the output cache region from the No. 1 element of the local state vector in the global memory to obtain a first array;
loading data into the output cache region from the-1 element of the local state vector in the global memory to obtain a second array;
after the first data and the second data are used for calculating the upper half part of the unitary matrix to obtain a first array, and the second data and the first data are used for calculating the lower half part of the unitary matrix to obtain a second array, the first data and the second data are selected according to bits to obtain a third array, a fifth array is obtained by carrying out bit selection on the third array and the fourth array, and data in the fifth array are carried to the global memory from the output buffer area.
9. The method of claim 1, wherein the constructing the quantum simulator, wherein the quantum bit gate operator is included in the quantum simulator, and wherein the quantum simulator is a software package for implementing quantum wire simulation, comprises:
and generating a software package which can be deployed and compiled to be executed in the promotion AI processor based on the quantum bit gate operator package to serve as the quantum simulator.
10. A quantum simulator implementation apparatus, the apparatus comprising:
the operator construction module is used for constructing a quantum bit gate operator, wherein the quantum bit gate operator is used for carrying data in target equipment and updating an attitude vector based on an operation bit of a quantum gate;
the quantum simulator building module is used for building a quantum simulator, wherein the quantum simulator comprises the quantum bit gate operator, and the quantum simulator is a software package for realizing quantum line simulation;
a deployment module to deploy the quantum simulator into the target device, the target device being a soar AI processor.
11. A quantum simulation method applied to an eaton AI processor in which the quantum simulator of any one of claims 1 to 9 is disposed, the method comprising:
acquiring data to be calculated;
and calling the quantum bit gate operator to carry the data to be calculated and update the state vector.
12. An intelligent terminal, characterized in that the intelligent terminal comprises a memory, a processor and a quantum simulator implementation program stored on the memory and executable on the processor, the quantum simulator implementation program, when executed by the processor, implementing the steps of the quantum simulator implementation method as claimed in any one of claims 1 to 9.
13. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a quantum simulator implementation program, which, when executed by a processor, implements the steps of the quantum simulator implementation method as claimed in any one of claims 1 to 9.
CN202111040089.0A 2021-09-06 2021-09-06 Quantum simulator implementation method and device, related equipment and quantum simulation method Pending CN113887730A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111040089.0A CN113887730A (en) 2021-09-06 2021-09-06 Quantum simulator implementation method and device, related equipment and quantum simulation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111040089.0A CN113887730A (en) 2021-09-06 2021-09-06 Quantum simulator implementation method and device, related equipment and quantum simulation method

Publications (1)

Publication Number Publication Date
CN113887730A true CN113887730A (en) 2022-01-04

Family

ID=79008353

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111040089.0A Pending CN113887730A (en) 2021-09-06 2021-09-06 Quantum simulator implementation method and device, related equipment and quantum simulation method

Country Status (1)

Country Link
CN (1) CN113887730A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117236457A (en) * 2023-11-13 2023-12-15 国开启科量子技术(安徽)有限公司 Method, system and electronic device for operating and using quantum simulator

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117236457A (en) * 2023-11-13 2023-12-15 国开启科量子技术(安徽)有限公司 Method, system and electronic device for operating and using quantum simulator

Similar Documents

Publication Publication Date Title
EP4036724A1 (en) Method for splitting neural network model by using multi-core processor, and related product
US11568258B2 (en) Operation method
KR102175044B1 (en) Apparatus and method for running artificial neural network reverse training
EP3888012A1 (en) Adjusting precision and topology parameters for neural network training based on a performance metric
US20210216318A1 (en) Vector Processor Architectures
EP3877913A1 (en) Training neural network accelerators using mixed precision data formats
US9704094B2 (en) Mapping of algorithms to neurosynaptic hardware
US20220121903A1 (en) Method of performing splitting in neural network model by means of multi-core processor, and related product
US11768911B2 (en) Method and apparatus for execution of neural network
CN112236784A (en) Modifying machine learning models to improve locality
CN112734040A (en) Embedded artificial intelligence computing framework and application method
JP7246447B2 (en) Model training method, apparatus, electronic device, storage medium, development system and program
US20240176845A1 (en) Method and device for matrix multiplication optimization using vector registers
WO2021061329A1 (en) Apparatus and system for execution of neural network
WO2023134453A1 (en) Operator processing method and computer device
US20200356836A1 (en) Fast deep learning fully-connected column-major implementation
CN113885941A (en) Singular value decomposition operation implementation method, device and related equipment
CN113887730A (en) Quantum simulator implementation method and device, related equipment and quantum simulation method
Hosseiny et al. Hardware acceleration of YOLOv7-tiny using high-level synthesis tools
CN111931939A (en) Single-amplitude quantum computation simulation method
CN114327630B (en) High-performance operator generation method suitable for Huaji Shengteng chip
WO2019084254A1 (en) Tensor manipulation within a neural network
US20240184554A1 (en) Vectorizing a loop
KR102576762B1 (en) A Processing-In-Memory Accelerator for End-to-End On-Device Training
US12008469B1 (en) Acceleration of neural networks with stacks of convolutional layers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination