pyqlearning package¶

Subpackages¶

Submodules¶

pyqlearning.annealing_model module¶

class pyqlearning.annealing_model.AnnealingModel[source]¶

Bases: object

Abstract class of Annealing.

There are many hyperparameters that we have to set before the actual searching and learning process begins. Each parameter should be decided in relation to Deep/Reinforcement Learning theory and it cause side effects in training model. Because of this complexity of hyperparameters, so-called the hyperparameter tuning must become a burden of Data scientists and R & D engineers from the perspective of not only a theoretical point of view but also implementation level.

This issue can be considered as Combinatorial optimization problem which is an optimization problem, where an optimal solution has to be identified from a finite set of solutions. The solutions are normally discrete or can be converted into discrete. This is an important topic studied in operations research such as software engineering, artificial intelligence(AI), and machine learning. For instance, travelling sales man problem is one of the popular combinatorial optimization problem.

In this problem setting, this library provides an Annealing Model to search optimal combination of hyperparameters. For instance, Simulated Annealing is a probabilistic single solution based search method inspired by the annealing process in metallurgy. Annealing is a physical process referred to as tempering certain alloys of metal, glass, or crystal by heating above its melting point, holding its temperature, and then cooling it very slowly until it solidifies into a perfect crystalline structure. The simulation of this process is known as simulated annealing.

References

Bektas, T. (2006). The multiple traveling salesman problem: an overview of formulations and solution procedures. Omega, 34(3), 209-219.
Bertsimas, D., & Tsitsiklis, J. (1993). Simulated annealing. Statistical science, 8(1), 10-15.
Das, A., & Chakrabarti, B. K. (Eds.). (2005). Quantum annealing and related optimization methods (Vol. 679). Springer Science & Business Media.
Du, K. L., & Swamy, M. N. S. (2016). Search and optimization by metaheuristics. New York City: Springer.
Edwards, S. F., & Anderson, P. W. (1975). Theory of spin glasses. Journal of Physics F: Metal Physics, 5(5), 965.
Facchi, P., & Pascazio, S. (2008). Quantum Zeno dynamics: mathematical and physical aspects. Journal of Physics A: Mathematical and Theoretical, 41(49), 493001.
Heim, B., Rønnow, T. F., Isakov, S. V., & Troyer, M. (2015). Quantum versus classical annealing of Ising spin glasses. Science, 348(6231), 215-217.
Heisenberg, W. (1925) Über quantentheoretische Umdeutung kinematischer und mechanischer Beziehungen. Z. Phys. 33, pp.879—893.
Heisenberg, W. (1927). Über den anschaulichen Inhalt der quantentheoretischen Kinematik und Mechanik. Zeitschrift fur Physik, 43, 172-198.
Heisenberg, W. (1984). The development of quantum mechanics. In Scientific Review Papers, Talks, and Books -Wissenschaftliche Übersichtsartikel, Vorträge und Bücher (pp. 226-237). Springer Berlin Heidelberg. Hilgevoord, Jan and Uffink, Jos, “The Uncertainty Principle”, The Stanford Encyclopedia of Philosophy (Winter 2016 Edition), Edward N. Zalta (ed.), URL = ＜https://plato.stanford.edu/archives/win2016/entries/qt-uncertainty/＞.
Jarzynski, C. (1997). Nonequilibrium equality for free energy differences. Physical Review Letters, 78(14), 2690.
Messiah, A. (1966). Quantum mechanics. 2 (1966). North-Holland Publishing Company.
Mezard, M., & Montanari, A. (2009). Information, physics, and computation. Oxford University Press.
Nallusamy, R., Duraiswamy, K., Dhanalaksmi, R., & Parthiban, P. (2009). Optimization of non-linear multiple traveling salesman problem using k-means clustering, shrink wrap algorithm and meta-heuristics. International Journal of Nonlinear Science, 8(4), 480-487.
Schrödinger, E. (1926). Quantisierung als eigenwertproblem. Annalen der physik, 385(13), S.437-490.
Somma, R. D., Batista, C. D., & Ortiz, G. (2007). Quantum approach to classical statistical mechanics. Physical review letters, 99(3), 030603.
鈴木正. (2008). 「組み合わせ最適化問題と量子アニーリング: 量子断熱発展の理論と性能評価」.,『物性研究』, 90(4): pp598-676. 参照箇所はpp619-624.
西森秀稔、大関真之(2018) 『量子アニーリングの基礎』須藤彰三、岡真監修、共立出版、参照箇所はpp9-46.

accepted_pos¶: getter

annealing()[source]¶: Annealing.

computed_cost_arr¶: getter

current_cost_arr¶: getter

current_dist_arr¶: getter

fit_dist_mat(dist_mat_arr)[source]¶

Fit ovserved data points.

Parameters:	dist_mat_arr – fitted data points.

get_accepted_pos()[source]¶: getter

get_computed_cost_arr()[source]¶: getter

get_current_cost_arr()[source]¶: getter

get_current_dist_arr()[source]¶: getter

get_predicted_log_arr()[source]¶: getter

get_predicted_log_list()[source]¶: getter

get_stocked_predicted_arr()[source]¶: getter

get_var_arr()[source]¶: getter

get_var_log_arr()[source]¶: getter

get_x()[source]¶: getter

predicted_log_arr¶: getter

predicted_log_list¶: getter

set_accepted_pos(value)[source]¶: setter

set_computed_cost_arr(value)[source]¶: setter

set_current_cost_arr(value)[source]¶: setter

set_current_dist_arr(value)[source]¶: setter

set_predicted_log_arr(value)[source]¶: setter

set_predicted_log_list(value)[source]¶: setter

set_stocked_predicted_arr(value)[source]¶: setter

set_var_arr(value)[source]¶: setter

set_var_log_arr(value)[source]¶: setter

set_x(value)[source]¶: setter

stocked_predicted_arr¶: getter

var_arr¶: getter

var_log_arr¶: getter

x¶: getter

pyqlearning.deep_q_learning module¶

class pyqlearning.deep_q_learning.DeepQLearning(function_approximator)[source]¶

Bases: object

Abstract base class to implement the Deep Q-Learning.

The Reinforcement learning theory presents several issues from a perspective of deep learning theory(Mnih, V., et al. 2013). Firstly, deep learning applications have required large amounts of hand-labelled training data. Reinforcement learning algorithms, on the other hand, must be able to learn from a scalar reward signal that is frequently sparse, noisy and delayed.

The difference between the two theories is not only the type of data but also the timing to be observed. The delay between taking actions and receiving rewards, which can be thousands of timesteps long, seems particularly daunting when compared to the direct association between inputs and targets found in supervised learning.

Another issue is that deep learning algorithms assume the data samples to be independent, while in reinforcement learning one typically encounters sequences of highly correlated states. Furthermore, in Reinforcement learning, the data distribution changes as the algorithm learns new behaviours, presenting aspects of recursive learning, which can be problematic for deep learning methods that assume a fixed underlying distribution.

Increasing the complexity of states/actions is equivalent to increasing the number of combinations of states/actions. If the value function is continuous and granularities of states/actions are extremely fine, the combinatorial explosion will be encountered. In other words, this basic approach is totally impractical, because the state/action-value function is estimated separately for each sequence, without any generalisation. Instead, it is common to use a function approximator to estimate the state/action-value function.

Considering many variable parts and functional extensions in the Deep Q-learning paradigm from perspective of commonality/variability analysis in order to practice object-oriented design, this abstract class defines the skeleton of a Deep Q-Learning algorithm in an operation, deferring some steps in concrete variant algorithms such as Epsilon Deep Q-Network to client subclasses. This abstract class in this library lets subclasses redefine certain steps of a Deep Q-Learning algorithm without changing the algorithm’s structure.

References

Egorov, M. (2016). Multi-agent deep reinforcement learning.(URL: https://pdfs.semanticscholar.org/dd98/9d94613f439c05725bad958929357e365084.pdf)
Gupta, J. K., Egorov, M., & Kochenderfer, M. (2017, May). Cooperative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems (pp. 66-83). Springer, Cham.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.

alpha_value¶: getter Learning rate.

check_the_end_flag(state_arr)[source]¶

Check the end flag.

If this return value is True, the learning is end.

As a rule, the learning can not be stopped. This method should be overrided for concreate usecases.

Parameters:	state_arr – np.ndarray of state in self.t.
Returns:	bool

extract_possible_actions(state_arr)[source]¶

Extract possible actions.

Parameters:	state_arr – np.ndarray of state.
Returns:	np.ndarray of actions.

function_approximator¶: getter

gamma_value¶: getter Gamma value.

get_alpha_value()[source]¶: getter Learning rate.

get_function_approximator()[source]¶: getter

get_gamma_value()[source]¶: getter Gamma value.

get_q_logs_arr()[source]¶: getter

learn(state_arr, limit=1000)[source]¶

Learning and searching the optimal solution.

Parameters:	state_arr – np.ndarray of initial state. limit – The maximum number of iterative updates based on value iteration algorithms.

learn_q(predicted_q_arr, real_q_arr)[source]¶

Learn Q with the function approximator.

Parameters:	predicted_q_arr – np.ndarray of predicted Q-Values. real_q_arr – np.ndarray of real Q-Values.

observe_reward_value(state_arr, action_arr)[source]¶

Compute the reward value.

Parameters:	state_arr – np.ndarray of state. action_arr – np.ndarray of action.
Returns:	Reward value.

q_logs_arr¶: getter

select_action(next_action_arr, next_q_arr)[source]¶

Select action by Q(state, action).

Parameters:	next_action_arr – np.ndarray of actions. next_q_arr – np.ndarray of Q-Values.
Returns:	Tuple(np.ndarray of action., Q-Value)

set_alpha_value(value)[source]¶: setter Learning rate.

set_function_approximator(value)[source]¶

set_gamma_value(value)[source]¶: setter Gamma value.

set_q_logs_arr(values)[source]¶: setter

update_q(predicted_q_arr, reward_value_arr, next_max_q_arr)[source]¶

Update Q.

Parameters:	predicted_q_arr – np.ndarray of predicted Q-Values. reward_value_arr – np.ndarray of reward values. next_max_q_arr – np.ndarray of maximum Q-Values in next time step.
Returns:	np.ndarray of real Q-Values.

update_state(state_arr, action_arr)[source]¶

Update state.

This method can be overrided for concreate usecases.

Parameters:	state_arr – np.ndarray of state in self.t. action_arr – np.ndarray of action in self.t.
Returns:	np.ndarray of state in self.t+1.

pyqlearning.function_approximator module¶

class pyqlearning.function_approximator.FunctionApproximator[source]¶

Bases: object

The interface of Function Approximators.

Typically, the Deep Q-Learning such as the Deep Q-Network uses the Convolutional Neural Networks(CNN) as a function approximator to solve problem setting of so-called Combination explosion.

But it is not inevitable to functionally reuse CNN as a function approximator. In the above problem setting of generalisation and Combination explosion, for instance, Long Short-Term Memory(LSTM) networks, which is-a special Reccurent Neural Network(RNN) structure, and CNN as a function approximator are functionally equivalent. In the same problem setting, functional equivalents can be functionally replaced. Considering that the feature space of the rewards has the time-series nature, LSTM will be more useful.

This interface defines methods to controll functionally equivalents of CNN. DeepQLearning can be delegated an object that is-a this interface. More detail, this interface defines a family of algorithms of Deep Learning, such as LSTM, Convolutional LSTM(Xingjian, S. H. I. et al., 2015), and CLDNN Architecture(Sainath, T. N, et al., 2015) encapsulate each one, and make them interchangeable. Strategy lets the function approximation algorithm vary independently from the clients that use it. Capture the abstraction in an interface, bury implementation details in derived classes.

References

https://code.accel-brain.com/Deep-Learning-by-means-of-Design-Pattern/README.html
https://code.accel-brain.com/Reinforcement-Learning/README.html#deep-q-network
[Egorov, M. (2016). Multi-agent deep reinforcement learning.](https://pdfs.semanticscholar.org/dd98/9d94613f439c05725bad958929357e365084.pdf)
Gupta, J. K., Egorov, M., & Kochenderfer, M. (2017, May). Cooperative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems (pp. 66-83). Springer, Cham.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
Sainath, T. N., Vinyals, O., Senior, A., & Sak, H. (2015, April). Convolutional, long short-term memory, fully connected deep neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on (pp. 4580-4584). IEEE.
Xingjian, S. H. I., Chen, Z., Wang, H., Yeung, D. Y., Wong, W. K., & Woo, W. C. (2015). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems (pp. 802-810).

inference_q(next_action_arr)[source]¶

Infernce Q-Value.

Parameters:	next_action_arr – np.ndarray of action.
Returns:	np.ndarray of Q-Values.

learn_q(predicted_q_arr, real_q_arr)[source]¶

Infernce Q-Value.

Parameters:	predicted_q_arr – np.ndarray of predicted Q-Values. real_q_arr – np.ndarray of real Q-Values.

model¶: object of model as a function approximator.

pyqlearning.q_learning module¶

class pyqlearning.q_learning.QLearning[source]¶

Bases: object

Abstract base class and Template Method Pattern of Q-Learning.

According to the Reinforcement Learning problem settings, Q-Learning is a kind of Temporal Difference learning(TD Learning) that can be considered as hybrid of Monte Carlo method and Dynamic Programming method. As Monte Carlo method, TD Learning algorithm can learn by experience without model of environment. And this learning algorithm is functional extension of bootstrap method as Dynamic Programming Method.

In this library, Q-Learning can be distinguished into Epsilon Greedy Q-Leanring and Boltzmann Q-Learning. These algorithm is functionally equivalent but their structures should be conceptually distinguished.

Considering many variable parts and functional extensions in the Q-learning paradigm from perspective of commonality/variability analysis in order to practice object-oriented design, this abstract class defines the skeleton of a Q-Learning algorithm in an operation, deferring some steps in concrete variant algorithms such as Epsilon Greedy Q-Leanring and Boltzmann Q-Learning to client subclasses. This abstract class in this library lets subclasses redefine certain steps of a Q-Learning algorithm without changing the algorithm’s structure.

References

Agrawal, S., & Goyal, N. (2011). Analysis of Thompson sampling for the multi-armed bandit problem. arXiv preprint arXiv:1111.1797.
Bubeck, S., & Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv preprint arXiv:1204.5721.
Chapelle, O., & Li, L. (2011). An empirical evaluation of thompson sampling. In Advances in neural information processing systems (pp. 2249-2257).
Du, K. L., & Swamy, M. N. S. (2016). Search and optimization by metaheuristics (p. 434). New York City: Springer.
Kaufmann, E., Cappe, O., & Garivier, A. (2012). On Bayesian upper confidence bounds for bandit problems. In International Conference on Artificial Intelligence and Statistics (pp. 592-600).
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
Richard Sutton and Andrew Barto (1998). Reinforcement Learning. MIT Press.
Watkins, C. J. C. H. (1989). Learning from delayed rewards (Doctoral dissertation, University of Cambridge).
Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine learning, 8(3-4), 279-292.
White, J. (2012). Bandit algorithms for website optimization. ” O’Reilly Media, Inc.”.

alpha_value¶: getter Learning rate.

check_the_end_flag(state_key)[source]¶

Check the end flag.

If this return value is True, the learning is end.

As a rule, the learning can not be stopped. This method should be overrided for concreate usecases.

Parameters:	state_key – The key of state in self.t.
Returns:	bool

extract_possible_actions(state_key)[source]¶

Extract the list of the possible action in self.t+1.

Abstract method for concreate usecases.

Parameters:	The key of state in self.t+1. (state_key) –
Returns:	list of the possible actions in self.t+1.

extract_q_df(state_key, action_key)[source]¶

Extract Q-Value from self.q_df.

Parameters:	state_key – The key of state. action_key – The key of action.
Returns:	Q-Value.

extract_r_df(state_key, r_value, action_key=None)[source]¶

Insert or update R-Value in self.r_df.

Parameters:	state_key – The key of state. r_value – R-Value(Reward). action_key – The key of action if it is nesesary for the parametar of value function.

Exceptions:: TypeError: If the type of r_value is not float.

gamma_value¶: getter Gamma value.

get_alpha_value()[source]¶: getter Learning rate.

get_gamma_value()[source]¶: getter Gamma value.

get_q_df()[source]¶: getter

get_r_df()[source]¶: getter

get_t()[source]¶: getter Time.

learn(state_key, limit=1000)[source]¶

Learning and searching the optimal solution.

Parameters:	state_key – Initial state. limit – The maximum number of iterative updates based on value iteration algorithms.

normalize_q_value()[source]¶

Normalize q-value. This method should be overrided for concreate usecases.

This method is called in each learning steps.

For example:: self.q_df.q_value = self.q_df.q_value / self.q_df.q_value.sum()

normalize_r_value()[source]¶

Normalize r-value. This method should be overrided for concreate usecases.

This method is called in each learning steps.

For example:: self.r_df.r_value = self.r_df.r_value / self.r_df.r_value.sum()

observe_reward_value(state_key, action_key)[source]¶

Compute the reward value.

Parameters:	state_key – The key of state. action_key – The key of action.
Returns:	Reward value.

predict_next_action(state_key, next_action_list)[source]¶

Predict next action by Q-Learning.

Parameters:	state_key – The key of state in self.t+1. next_action_list – The possible action in self.t+1.
Returns:	The key of action.

q_df¶: getter

r_df¶: getter

save_q_df(state_key, action_key, q_value)[source]¶

Insert or update Q-Value in self.q_df.

Parameters:	state_key – State. action_key – Action. q_value – Q-Value.

Exceptions:: TypeError: If the type of q_value is not float.

save_r_df(state_key, r_value, action_key=None)[source]¶

Insert or update R-Value in self.r_df.

Parameters:	state_key – The key of state. r_value – R-Value(Reward). action_key – The key of action if it is nesesary for the parametar of value function.

Exceptions:: TypeError: If the type of r_value is not float.

select_action(state_key, next_action_list)[source]¶

Select action by Q(state, action).

Abstract method for concreate usecases.

Parameters:	state_key – The key of state. next_action_list – The possible action in self.t+1. If the length of this list is zero, all action should be possible.
Returns:	The key of action.

set_alpha_value(value)[source]¶: setter Learning rate.

set_gamma_value(value)[source]¶: setter Gamma value.

set_q_df(value)[source]¶: setter

set_r_df(value)[source]¶: setter

set_t(value)[source]¶: setter Time.

t¶: getter Time.

update_q(state_key, action_key, reward_value, next_max_q)[source]¶

Update Q-Value.

Parameters:	state_key – The key of state. action_key – The key of action. reward_value – R-Value(Reward). next_max_q – Maximum Q-Value.

update_state(state_key, action_key)[source]¶

Update state.

This method can be overrided for concreate usecases.

Parameters:	state_key – The key of state in self.t. action_key – The key of action in self.t.
Returns:	The key of state in self.t+1.

visualize_learning_result(state_key)[source]¶

Visualize learning result. This method should be overrided for concreate usecases.

This method is called in last learning steps.

Parameters:	state_key – The key of state in self.t.

Previous topic

Next topic

pyqlearning package¶

Subpackages¶

Submodules¶

pyqlearning.annealing_model module¶

pyqlearning.deep_q_learning module¶

pyqlearning.function_approximator module¶

pyqlearning.q_learning module¶

Module contents¶