I Measurementbased Adaptation Protocol with Quantum Reinforcement Learning
We review the protocol of Ref. Measurement based adaptation protocol , which we subsequently implement in the Rigetti cloud quantum computer. The aim of this algorithm is to adapt a quantum state to a reference unknown state via successive measurements. Many identical copies of the reference state are needed for this protocol and after each measurement, which destroys the state, more information about it is obtained. The system that we consider is composed of the following parts:

The environment system, E: contains the reference state copies.

The register, R: interacts with E and obtains information from it.

The agent, A: it is adapted by digital feedback depending on the outcome of the measurement of the register.
Let us assume that we know the state of a quantum system called agent and that many copies of an unknown quantum state called environment are provided. Let us also consider an auxiliary system called register that interacts with E. Thus, we extract information from E by measuring R and use the result as an input for a reward function (RF).
Subsequently, we perform a partiallyrandom unitary transformation on A, which depends on the output of the RF.
Let us present the simplest case in which each subsystem is described by a qubit state: Agent ; Register and Environment .
Therefore, the initial state is,
(1) 
Then, we apply a CNOT gate with E as control and R as target (policy), in order to obtain information from E, namely
(2) 
Secondly, we measure the register qubit in the basis
with probability
or of obtaining the state or , respectively. Depending on the result of the measurement, we either do nothing if the result is , since it means that we collapse E into A, or perform a partiallyrandom unitary operator on A, if the result is This unitary transformation (action) is given by(3) 
where is the spin component; and are random angles and are random numbers. The range of the random numbers is .
Now, we initialize the register qubit state and use a new copy of the environment obtaining the following initial state for the second iteration,
(4) 
with where is the outcome of the measurement, and we call .
Later on, we define the RF as
(5) 
As we can see, the exploration range of the iteration, , is modified by when the outcome of the iteration was and by , when it was . Here we have chosen , , with being a constant.
We define the value function (VF) in this protocol as the value of after many iterations assuming that , i.e., that the agent converges to the environment state.
In the iteration of the protocol, we assume that the system starts in the following state,
(6) 
where , , and the accumulated rotation operator is,
(7) 
with .
Then, we perform the gate ,
(8) 
and measure , with probabilities and , for the outcomes and , respectively. Finally, we update the reward function, , for the next iteration.
Ii Experimental setup: Rigetti Forest Cloud Quantum Computer
Cloud quantum computers are dedicated quantum processors operated by users logged through the internet (cloud). Although DWave Systems, Inc. DWave was among the first companies to commercialize quantum computers in 2011, it was not until the arrival of IBM Quantum Experience IBMQExperience in May 2016 that there was a quantum computer openly available in the cloud. Approximately one year later, in June 2017, Californian company Rigetti Computing Rigetti announced the availability of a cloud quantum computing platform. This last quantum platform is the one that we used in this work, because of convenience of their web interface to implement our protocol with feedback. In the past year, Rigetti built a brand new processor of 8 qubits called 8QAgave: this is the chip we employed. An advantage of this device is that one does not have to adapt the algorithm to the topology of the system. The compiler does it for us. In particular, it is the possibility of defining quantum gates in matrix form that makes the adaptation significantly simpler. Among other features, Rigetti Forest offers the possibility of running the experiments in their quantum virtual machine (QVM).
ii.1 Pythonimplemented algorithm
In this section, we will explain how the algorithm of Section I is adapted to be implemented in Riggeti simulator and quantum processor. Firstly, we must initialize some variables and constants in order to correctly perform the first iteration,

Reward and punishment ratios: and .

Exploration range: .

The unitary transformation matrices: .

Partiallyrandom unitary operator: .

Initial values of the random angles: . Makes for the first iteration.

Initial value of the iteration index: .

Number of iterations: .
The algorithm is composed of the following steps,

Step 1: While Go to step 2.

Step 2: If
(9) (10) (11) (12) (13) 
Step 3: First quantum algorithm.
First, we define the agent, environment and register qubits as,(14) and act upon the environment,
(15) Then, we have
(16) We apply the policy
(17) and measure the register qubit storing the result in .

Step 4: Second quantum algorithm.
Subsequently we act with on the agent qubit in order to take it to approach it to the environment state, :(18) Afterwards, we measure this qubit and store the result in a classical register array. We repeat step 4 a total number of times of 8192 in order to determine the state created after applying .

In this last step, we apply the reward function,
(19) and increase the iteration index by one after it: . Go to step 1.
Iii Experimental Results of Quantum Reinforcement Learning with the Rigetti Cloud Quantum Computer
In this section we describe the experimental results with the Rigetti cloud quantum computer of the measurementbased adaptation protocol of Ref. Measurement based adaptation protocol .
The algorithm has been proved for six different initial states of the environment. These states are the following,
(20) 
(21) 
(22) 
(23) 
(24) 
(25) 
Hereunder, we plot the exploration range, , and the fidelity, which we define in Eq. (26), in terms of the total number of iterations. The results obtained using the 8QAgave chip of Rigetti are shown with the corresponding ideal simulation with the Rigetti quantum virtual machine, that includes no noise. We point out that each of the experimental plots in this article contains a single realization of the experiment, given that, in the adaptation protocol, single instances are what are relevant instead of averages. This is due to the fact that the succesive measurements influence the subsequent ones and are influenced by the previous ones, in each experiment, in order for the agent to adapt to the environment. This has as a consequence the fact that the theory curve and the experimental curve for each example match qualitatively but not totally quantitatively, as they are both probabilistic. On the other hand, when the exploration range converges to zero we always observe convergence to a large final fidelity. The blue solid line represents the real experiment result while the red dasheddotted one corresponds to the ideal simulation. It is also worth mentioning that after doing many experiments with the ideal and real simulators for different values of the parameter , we fixed it to . We found it to yield a balanced explorationexploitation ratio. The explorationexploitation balance is a feature of reinforcement learning, where the optimal strategy must deal with exploring enough new possibilities while at the same time exploiting the ones that work best Russel Norvig .
To begin with, let us have a look at Fig. 1, in which the environment state is . It shows the exploration ratio, and the fidelity as defined in Eq. (26) in terms of the number of iterations. As we can see 140 iterations are enough for to take a value of almost zero. It is a good example of a balanced exploration versus exploitation ratio, that is, the exploration ratio decreases making continuous peaks. Each of them represents an exploration stage where increases at the same time that the fidelity changes significantly. These changes might bring a positive result, such that the agent receives a reward, which means that decreases, or a punishment and it keeps increasing. Thus, the fidelity is not constant and it does not attain a constant value of around 95% until it has done 100 iterations. In this case real and ideal experiments yield a similar result. It is true that the ideal decreases smoother and faster than the real one. However, the values of the fidelity after 130 iterations are practically the same.
In our calculations we employ a classical measure of the fidelity. The reason to use the classical fidelity, and not the quantum version, is the reduction of needed resources in the experiments. We cannot make full quantum state tomography as it is exponentially hard. Therefore, the definition used in the algorithm is,
(26) 
for onequbit measurements. Here, and stand for the probability of obtaining or as an outcome when measuring the real qubit and and are the same probabilities for the corresponding theoretical qubit state that we expect. This fidelity coincides at lowest order with the fully quantum one, illustrating the convergence of the protocol for a large number of iterations.
Let us continue with the discussion focussing on Fig. 2. In this experiment, the algorithm has to take the agent from the state to the environment state which is closer to one than to zero (0.6 is the probability of getting as an outcome when measuring the environment). Bearing this in mind, it seems reasonable that 70 iterations are not enough for to reach a value below 2, in the real case. Apart from this, in spite of achieving a value above 99% of fidelity in less than 20 iterations, the exploration still continues. In consequence, the agent drops from the state it is to one further from .
In general, we notice a clear relationship between how smooth the line is and how constant the fidelity remains. Indeed, the exploration ratio decreases smoothly from less than 20 iterations to less than 40. In this range the fidelity does not change because the agent is being rewarded. The price to pay for not exploring at so early stages of the learning is that the convergence of the delta is produced for a larger number of iterations than in other experiments. After 140 iterations we see that the convergence of is not guaranteed, namely, in the real experiment it has a value above 1. Regarding the ideal simulation result, we draw the conclusion that less than 20 iterations could be enough to converge to the environment state with fidelity larger than 99.9% and, what is more, remaining on the same state until the exploration range has converged to zero [see inset in Fig. 2].
In third place, we have the environment state . The results obtained using this initial state of the environment are presented in Figs. 3 and 4. Unlike the previous examples, we do not compare the ideal theory and real experiments, which have similarly good agreement. Instead, we contrast two different real experiment outcomes. In this way, we can show how even for the same experiment, i.e., initial state of the environment, the algorithm can experience different behaviours. In both cases, goes to zero and the fidelity reaches a constant value above 94% in the first case and 99% in the second one. However, this convergence is achieved in two different ways. On the one hand, we observe that exploitation predominates over exploration (see Fig. 4), except for several spots where the algorithm keeps exploring new options. Then, as the initial fidelity is larger than 90% the state of the agent converges to the environment with less than 70 iterations. On the other hand, when exploration is more important (as shown in Fig. 3) the fidelity is erratic, changing from low to high values. Moreover, it takes longer for to converge and for the fidelity to be stabilized  more than 80 iterations.
Let us focus just on the first stages of the learning process, for less than 40 iterations. Comparing both experiments, we see that in the first case it starts exploring from the very beginning, thus, with less than 20 iterations the fidelity takes a value above 99.6%. Whereas in the other case there is more exploitation at the beginning and around 25 iterations are required to reach a fidelity of 99%.
Among the six states that we have chosen, this is the one in which agent and environment are the closest. Nevertheless, 70 iterations are not enough to reach a value of below 1. So we can state that a smaller distance in the Bloch sphere between agent and environment does not imply in general a faster learning.
Let us analyse now Fig. 5. This state is again the most asymmetric case along with the previous one. However, unlike the previous experiment this one begins with the lowest value of the fidelity. The environment state is the farthest one to the initial agent state , with just a probability of 0.25 of achieving this outcome (zero) when measuring the environment. Therefore, as it might be expected, the algorithm is still exploring after 70 iterations rather than exploiting the knowledge it has already acquired from the environment. It is also proved that less than 100 iterations can be enough to reach a value of below 1. Then, in this case, it is proved that the agent has already converged to a state with fidelity larger than 97%. It is also remarkable how, in this case, with less than 10 iterations the fidelity has attained a value larger than 99%. However, as the delta had not converged yet, it goes out of this value later, exploring again. Once again, as a general rule, we can see that the algorithm is exploring for all the iterations. To explore is a synonym of changing fidelity, while whenever the delta decreases smoothly, the fidelity remains constant. With this result, we wanted to show how sometimes the real experiment converged faster, e.g., with just 9 iterations, to the environment state. On the top of that, the exploration range also went faster to zero than the ideal experiment. Nevertheless, the value of fidelity when has converged is exactly the same in both cases.
We analyze now the most symmetric cases, where the environment is prepared in a uniform superposition with a relative phase between both states. The experiment chosen to highlight here (see Fig. 6) is the one in which the fidelity reaches a constant value above 99.9% in less than 40 iterations. The corresponding evolves with a good balance between exploration and exploitation until reaching a point where it does not explore anymore and decreases very smoothly. Comparing it to the ideal case we notice two opposing behaviours, namely, the ideal case makes a larger exploration at the beginning which yields a larger constant value of the fidelity with less than 20 iterations, whereas the real system needs almost 40 to get to this point. On the other hand, in the real experiment there is a larger learning stage from zero to 40 iterations. Thus, in this particular case the exploration ratio diminishes faster in the real experiment for a large number of iterations. It happens because of the larger value of attained in the ideal experiment.
Finally, in this second symmetric case , there are no relative phases between the states of the computational basis. This experiment is very rich in phenomena and exemplifies very well how the algorithm works. Figure 7 shows clearly the fast increase of the fidelity until reaching values above 99.9% with just nine iterations. Initially, there is an exploration stage that makes the fidelity grow up to 99.9% with just 9 iterations. At the same time the exploration range, , grows making two consecutive peaks and then it decreases smoothly, while the fidelity remains constant. Subsequently, there are a couple of exploration peaks that make the fidelity oscillate. Now, after a few iterations where the fidelity decreases smoothly, we come to a third and most important exploration phase where we observe how the fidelity has an increasing tendency. It suffices to check that the subsequent minima of the fidelity take larger and larger values. Such an amount of exploration has as a long term reward a fidelity of 99.99% after less than 70 iterations. However, the exploration range is still large and leaves room for trying new states which make the fidelity drop again. Finally, the algorithm is able to find a good explorationexploitation balance which makes the fidelity increase and remain constant with values above 99.5%. On the top of that, the exploration range goes progressively to zero. The ideal experiment is an excellent example of how fast the algorithm could reach a high fidelity above 99.9% and also guaranteeing the convergence of . In this way, once the exploration range has become so small, it is assured that the agent does not go to another state. In other words, it is proved that the agent has definitely converged to the environment to a large fidelity.
In table 1 we sort some results of the experiments run in Rigetti 8QAgave cloud quantum computer from larger to smaller value of the fidelity. As we can see, is close to 100% in most of the experiments when the exploration ratio is approaching zero. Thus, we are succeeding in adapting the agent to the environment state. From this data we can also draw the conclusion that the convergence of the agent to the environment state is guaranteed whenever . We did not find any case where the exploration range is close to zero and the fidelity below 90%.
0.36  0.24  0.18  0.03  0.05  0.24  0.16  
99.89  99.72  99.53  99.20  97.72  97.53  94.72  
Initial environment state 
Iv Conclusions
In this article, we have performed the implementation of the measurementbased adaptation protocol with quantum reinforcement learning of Ref. Measurement based adaptation protocol . Consistently with it, we have checked that indeed the fidelity increases reaching values over 90% with less than about thirty iterations in real experiments. We did not observe any case where and . Thus, to a large extent, the protocol succeeds in making the agent converge to the environment. If we wanted to apply this algorithm to any subroutine, it would be possible to track the evolution of the exploration range and deduce from it the convergence of the agent to the environment. This is because the behaviour of has proven to be closely related to the fidelity performance.
We can conclude that there is still a long way to be travelled until the second quantum revolution gives rise to wellestablished techniques of quantum machine learning in the lab. However, this work is encouraging since it sows the seeds for turning quantum reinforcement learning into a reality, and the future implementation of semiautonomous quantum agents.
We point out that another implementation of Ref. Measurement based adaptation protocol in a different platform, namely, quantum photonics, has been recently achieved, in parallel to this work Hefei .
Acknowledgements
The authors acknowledge the use of Rigetti Forest for this work. The views expressed are those of the authors and do not reflect the official policy or position of Rigetti, or the Rigetti team. We thank I. Egusquiza for useful discussions. L. L. acknowledges support from Ramón y Cajal Grant RYC201211391, and J. C. from Juan de la Cierva grant IJCI201629681. We also acknowledge funding from MINECO/FEDER FIS201569983P and Basque Government IT98616. This material is also based upon work supported by the projects OpenSuperQ and QMiCS of the EU Flagship on Quantum Technologies, and by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR) quantum algorithm teams program, under field work proposal number ERKJ333.
References
 (1) F. AlbarránArriagada, J. C. Retamal, E. Solano, and L. Lamata, Measurementbased adaptation protocol with quantum reinforcement learning. Phys. Rev. A 98, 042315 (2018).
 (2) Aram W. Harrow, Avinatan Hassidim, and Seth Lloyd. Quantum algorithm for linear systems of equations. Phys. Rev. Lett. 103, 150502 (2009).
 (3) Nathan Wiebe, Daniel Braun, and Seth Lloyd. Quantum algorithm for data fitting. Phys. Rev. Lett. 109, 050505 (2012).
 (4) Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum algorithms for supervised and unsupervised machine learning. arXiv:1307.0411, July 2013.

(5)
Patrick Rebentrost, Masoud Mohseni, and Seth Lloyd. Quantum support vector machine for big data classification. Phys. Rev. Lett.
113, 130503 (2014).  (6) Nathan Wiebe, Ashish Kapoor, and Krysta M. Svore. Quantum algorithms for nearestneighbor methods for supervised and unsupervised learning. Quantum Info. Comput. 15, 316 (2015).

(7)
Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum principal component analysis. Nat. Phys.
10, 631 (2014).  (8) H. K. Lau, R. Pooser, G. Siopsis, and C. Weedbrook. Quantum machine learning over infinite dimensions. Phys. Rev. Lett. 118, 080501 (2017).
 (9) M. Schuld, I. Sinayskiy, and F. Petruccione. An introduction to quantum machine learning. Contemporary Physics 56, 172 (2015).
 (10) J. Adcock, E. Allen, M. Day, S. Frick, J. Hinchli, M. Johnson, S. MorleyShort, S. Pallister, A. Price, and S. Stanisic. Advances in quantum machine learning. arXiv preprint arXiv:1512.02900 (2015).
 (11) V. Dunjko, J. M. Taylor, and H. J. Briegel. Quantumenhanced machine learning. Phys. Rev. Lett. 117, 130501 (2016).
 (12) V. Dunjko and H. J. Briegel. Machine learning & artificial intelligence in the quantum domain: a review of recent progress. Rep. Prog. Phys. 81, 074001 (2018).
 (13) J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost, N. Wiebe, and S. Lloyd. Quantum machine learning. Nature 549, 195 (2017).
 (14) R. Biswas, Z. Jiang, K. Kechezhi, S. Knysh, S. Mandrà, B. O’Gorman, A. PerdomoOrtiz, A. Petukhov, J. RealpeGómez, E. Rieffel, D. Venturelli, F. Vasko, and Z. Wang, A NASA perspective on quantum computing: Opportunities and challenges. Parallel Computing 64, 81 (2017).
 (15) A. PerdomoOrtiz, M. Benedetti, J. RealpeGómez, and R. Biswas, Opportunities and challenges for quantumassisted machine learning in nearterm quantum computers. Quantum Sci. Technol. 3, 030502 (2018).
 (16) A. PerdomoOrtiz, A. Feldman, A. Ozaeta, S. V. Isakov, Z. Zhu, B. O’Gorman, H. G. Katzgraber, A. Diedrich, H. Neven, J. de Kleer, et al., On the readiness of quantum optimization machines for industrial applications. arXiv preprint arXiv:1708.09780 (2017).
 (17) M. Sasaki and A. Carlini, Quantum learning and universal quantum matching machine. Phys. Rev. A 66, 022303 (2002).
 (18) M. Benedetti, J. RealpeGómez, R. Biswas, and A. PerdomoOrtiz, QuantumAssisted Learning of HardwareEmbedded Probabilistic Graphical Models. Phys. Rev. X 7, 041052 (2017).

(19)
M. Benedetti, J. RealpeGómez, and A. PerdomoOrtiz. Quantumassisted helmholtz machines: a quantumclassical deep learning framework for industrial datasets in nearterm devices. Quantum Sci. Technol.
3, 034007 (2018).  (20) E. Aïmeur, G. Brassard, and S. Gambs. Quantum speedup for unsupervised learning. Machine Learning 90, 261 (2013).

(21)
Alexey A. Melnikov, Hendrik Poulsen Nautrup, Mario Krenn, Vedran Dunjko, Markus Tiersch, Anton Zeilinger, and Hans J. Briegel. Active learning machine learns to create new quantum experiments. PNAS
115, 1221 (2018).  (22) U. AlvarezRodriguez, L. Lamata, P. EscandellMontero, J. D. MartínGuerrero, and E. Solano, Supervised Quantum Learning without Measurements. Scientific Reports 7, 13645 (2017).
 (23) G. D. Paparo, V. Dunjko, A. Makmal, M. A. MartinDelgado, and H. J. Briegel, Quantum Speedup for Active Learning Agents. Phys. Rev. X 4, 031002 (2014).
 (24) L. Lamata, Basic protocols in quantum reinforcement learning with superconducting circuits. Scientific Reports 7, 1609 (2017).
 (25) F. A. CárdenasLópez, L. Lamata, J.C. Retamal, and E. Solano. Multiqubit and multilevel quantum reinforcement learning with quantum technologies. PLOS ONE 13(7): e0200455 (2018).
 (26) D. Dong, C. Chen, H. Li, and T. J. Tarn, Quantum reinforcement learning. IEEE Transactions on Syst. Man, Cybern. Part B (Cybernetics) 38, 1207–1220 (2008).

(27)
D. Crawford, A. Levit, N. Ghadermarzy, J. S. Oberoi, and P. Ronagh, Reinforcement learning using quantum Boltzmann machines. Quant. Inf. Comput.
18, 51 (2018).  (28) S. Russell and P. Norvig, Artificial Intelligence: A modern approach (Prentice hall, 1995).
 (29) https://www.dwavesys.com/home
 (30) https://www.research.ibm.com/ibmq/
 (31) https://www.rigetti.com
 (32) Shang Yu, F. AlbarránArriagada, J. C. Retamal, YiTao Wang, Wei Liu, ZhiJin Ke, Yu Meng, ZhiPeng Li, JianShun Tang, E. Solano, L. Lamata, ChuanFeng Li, and GuangCan Guo. Reconstruction of a Photonic Qubit State with Quantum Reinforcement Learning. arXiv:1808.09241.
Comments
There are no comments yet.