Reinforcement Learning Queue Control

Reinforcement learning is characterized by the learning problem and not by any specific method. A challenge in reinforcement learning is balancing exploration and exploitation (Sutton & Barto, 2012). Reinforcement learning systems have a policy, a reward function for short term goals, a value function for long term goals, and typically a model of the environment (Sutton & Barto, 2012).

Queue Control System

The system under consideration is the controlling of queues. To elaborate on this problem further, the system model has high priority transactions which must be completed in “real-time” as a human user is directly waiting for them. These are typically during the day and tend to be done in “bursts”. Another class of transactions can be done in the background; those are commonly referred to as batch transactions There is a large supply of batch transactions to occupy idle periods.

For short term and long term goals, there is a reward function and a value function respectively (Sutton & Barto, 2012). The long-term goal is to maximize the entire throughput of the system. The value function is the count of successful completed transactions per day. The short-term goal is that “real-time” transactions must be completed within a short service-level agreement time frame without errors. In this sense, the rewards are lack of negative penalties for timeouts or errors.

What makes this problem difficult is the complexity of the environment which depends upon shared resources. For example, there is a local database and there are external databases. There is a random confounding factor in which the external databases have competing demands placed upon them by other systems. The transactions running through the system are not identical. Some transactions are rather complex and hit many external databases as well as the local database.

The parts that need optimization is the queue control agent with the number of workers. If too few workers are allocated, then overall transaction times will increase due to a backlog. If too many workers are created, then overall transaction times increase due to contention (both externally and internally). The number of workers must change given the demands on the system as well as the unknown demands placed upon the external databases.

A challenge in reinforcement learning is the challenge in balancing exploration and exploitation. To get a reward, the system must exploit something learned but to get better rewards, it needs to explore. In this problem, the system at times must “hire” another worker thread to attempt to run more transactions to determine if there is additional available bandwidth.

This becomes an optimization problem that is dynamic. What may have been optimal an hour ago might not be optimal now given the other random processes that are not visible to the system. In other words, those databases might be bogged down. Placing an undue load up on a database will cause a nonlinear response to happen. When the database is bogged down, execution time grows exponentially.

Temporal-Difference Learning

An approach to formalize this is to utilize Temporal-Difference Learning methodology (Sutton & Barto, 2012) by treating each worker count as a state.  Each state would maintain estimations as suggested by Abbeel (n.d.) of transaction times and potential errors. The learning rate for the two models types could be chosen to allow the models to slowly forget very old data to adapt to new circumstances as well as convergence. The “value” function is the count of successful transactions completed per day and the “reward” function is a penalty in this use-case as it the count of errors.

# Logic Based on Temporal Differencing
Let V[s] be an list of states.
Initialization s=1. #We will start with 1 worker.
alpha <- 0.05 #learning rate
Repeat
  if both queues are empty then wait idle
  if real-time queue is not empty then
      t <- pull transaction from real-time queue
  else if batch queue is not empty then
      t <- pull transaction from batch queue
  time <- how long to perform transaction t
  V[s].count <- V[s].count + 1 # count-error is the "value function"
  if error then V[s].error <- V[s].error + 1 # errors are "anti-reward"
  V[s].meanTime <- (1.0 - alpha) * V[s].meanTime + alpha*time 
  give back completed transaction
  #explore logic
  if random > (V[s+1].count+1)/(V[s+1].count+2) then s = s + 1; #hire another worker
  else if random > (V[s-1].count+1)/(V[s+1].count+2) then s = s - 1; #fire worker
  #exploit logic
  len = length of real-time queue 
  if( s>1 #we do not want to fire all workers
      and (len * V[s].meanTime )/s > (len * V[s-1].meanTime )/(s-1) #will the timings improve?
      and V[s].error/V[s].count <= V[s-1].error/V[s-1].count ) #we must maintain low errors
  then s = s - 1; #fire worker
  if( (len * V[s].meanTime )/s > (len * V[s+1].meanTime )/(s+1) #will the timings improve?
      and V[s].error/V[s].count <= V[s+1].error/V[s+1].count ) ) #we must maintain low errors
  then s = s + 1; #hire worker
  #otherwise keep worker count
until day is done

For more advanced logic with additional predictor variables, Arel (2015) hints that each state could contain an “online” Bayesian prediction model, which could be used for transaction failures. An “online” stochastic gradient-descent model for transaction times as suggested by Sutton and Barto (2012) could be utilized including additional factors such as hour of the day.

Conclusion

Using the approaches as outlined by the referenced readings, one could implement such a system and the system would adapt to ongoing changes. It would be self-tuning and require no administration once installed.  Reinforcement Learning techniques offer an exciting way to assemble machine learning into a framework for solving problems that are ever changing.

References

Abbeel, P. (n.d.) CS 188: Artificial intelligence reinforcement learning (RL). UC Berkeley. Retrieved from: https://inst.eecs.berkeley.edu/~cs188/sp12/slides/cs188%20lecture%2010%20and%2011%20–%20reinforcement%20learning%202PP.pdf

Arel, I. (2015) Reinforcement Learning in Artificial Intelligence. Retrieved from: http://web.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture1.pdf

Sutton, R. S. & Barto, A. G. (2012) Reinforcement Learning. MIT Press.

Post a comment

You may use the following HTML:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>