{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# CS640 Homework 5: HMM, MDP and RL\n",
        "\n",
        "This assignment can be divided into three parts, and each part contains some coding tasks:\n",
        "1. Hidden Markov Model\n",
        "    - Complete a model\n",
        "    - Implement the **Forward Procedure**, **Backward Procedure**, and the **Viterbi Algorithm**\n",
        "2. Markov Decision Process\n",
        "    - Implement **value iteration** and **policy iteration**\n",
        "3. Reinforcelemnt Learning\n",
        "    - Implement **Q-learning**\n",
        "\n",
        "Do **not** include any extra outputs in your final answer. If you use print() function for debugging, please run your code without them again before submitting.\n",
        "\n",
        "### Collaboration\n",
        "All questions must be answered **independently**.\n",
        "\n",
        "## Instructions\n",
        "\n",
        "### General Instructions\n",
        "In an ipython notebook, to run code in a cell or to render [Markdown](https://en.wikipedia.org/wiki/Markdown)+[LaTeX](https://en.wikipedia.org/wiki/LaTeX) press `Ctrl+Enter` or `[>|]`(like \"play\") button above. To edit any code or text cell (double) click on its content. To change cell type, choose \"Markdown\" or \"Code\" in the drop-down menu above.\n",
        "\n",
        "Most of the written questions are followed up a cell for you enter your answers. Please enter your answers in a new line below the **Answer** mark. If you do not see such cell, please insert one by yourself. Your answers and the questions should **not** be in the same cell.\n",
        "\n",
        "### Instructions on Math\n",
        "Some questions require you to enter math expressions. To enter your solutions, put down your derivations into the corresponding cells below using LaTeX. Show all steps when proving statements. If you are not familiar with LaTeX, you should look at some tutorials and at the examples listed below between \\$..\\$. The [OEIS website](https://oeis.org/wiki/List_of_LaTeX_mathematical_symbols) can also be helpful.\n",
        "\n",
        "Alternatively, you can scan your work from paper and insert the image(s) in a text cell.\n",
        "\n",
        "## Submission\n",
        "Once you are ready, save the note book as PDF file (File -> Print -> Save as PDF) and submit via Gradescope.\n",
        "\n",
        "Please select pages to match the questions on Gradescope. **You may be subject to a 5% penalty if you do not do so**."
      ],
      "metadata": {
        "id": "j9PamOLZsYzm"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Q1: Hidden Markov Model\n",
        "\n",
        "In this problem, you need to first complete a model by filling some missing values according to the existing ones, and then implement and run **Forward Procedure**, **Backward Procedure**, and the **Viterbi Algorithm**."
      ],
      "metadata": {
        "id": "KhZkRDrRuJ4r"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Problem Setup\n",
        "\n",
        "Consider an HMM with the following properties (unknown values are marked with \"?\").\n",
        "\n",
        "States: $S = \\{S_{1}, S_{2}, S_{3}, S_{4}\\}$\n",
        "\n",
        "Initial state probability distribution: $\\pi = \\{\\pi_{1} = 0.3, \\pi_{2} = 0.1, \\pi_{3} = ?, \\pi_{4} = 0.5\\}$\n",
        "\n",
        "Transition probability matrix A, where each entry $a_{ij}$ is the probability of moving from state $S_{i}$ to state $S_{j}$:\n",
        "\n",
        "$$\\begin{bmatrix}\n",
        "? & 0.5 & 0.4 & 0\\\\\n",
        "? & 0.3 & 0.4 & 0.1\\\\\n",
        "? & 0.1 & 0.2 & 0.7 \\\\\n",
        "? & 0.2 & 0.2 & 0.2\n",
        "\\end{bmatrix}$$\n",
        "\n",
        "Symbols probability matrix $B$, where each entry $b_{jk} = b_{j}(k) = P[V_{k} \\; | \\; S_{j}]$ is the probability of yielding $V_{k}$ in state $S_{j}$:\n",
        "\n",
        "$$\\begin{bmatrix}\n",
        "0.5 & 0.4 & ? \\\\\n",
        "? & 0.3 & 0.1 \\\\\n",
        "0.8 & ? & 0.1 \\\\\n",
        "? & 0.3 & 0.7\n",
        "\\end{bmatrix}$$\n",
        "\n",
        "Let $\\lambda = (A, B, \\pi)$.\n",
        "\n",
        "Suppose we observe a sequence $O = V_{1}V_{3}V_{2}V_{1}$."
      ],
      "metadata": {
        "id": "N8OtTfM1_7_W"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Q1.1\n",
        "\n",
        "Complete the model by filling the unknown values in the following code block and run the cell. Make sure the outputs are all **True**.\n",
        "\n"
      ],
      "metadata": {
        "id": "OyRdHZQY1joS"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import numpy as np\n",
        "\n",
        "M = 3 # vocabulary size\n",
        "N = 4 # state number\n",
        "V = [None, 0, 1, 2] # one-based, V[0] = None to align the indices\n",
        "\n",
        "# fill in the blank values for the following three arrays\n",
        "pi = np.array([0.3, 0.1, , 0.5])\n",
        "A = np.array([[, 0.5, 0.4, 0], [, 0.3, 0.4, 0.1], [, 0.1, 0.2, 0.7], [, 0.2, 0.2, 0.2]])\n",
        "B = np.array([[0.5, 0.4, ], [, 0.3, 0.1], [0.8, , 0.1], [, 0.3, 0.7]])\n",
        "\n",
        "O = [V[1], V[3], V[2], V[1]]\n",
        "\n",
        "print(\"Sum of pi is one? \" + str(np.linalg.norm(pi.sum() - 1) < 1e-9))\n",
        "print(\"Sum of each row of A is one? \" + str(np.linalg.norm(A.sum(axis = 1) - 1) < 1e-9))\n",
        "print(\"Sum of each row of B is one? \" + str(np.linalg.norm(B.sum(axis = 1) - 1) < 1e-9))"
      ],
      "metadata": {
        "id": "f_DnIjNs3uIs"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Q1.2\n",
        "\n",
        "What is $P(O | \\lambda)$? Answer the question by finishing implementing the **Forward Procedure** and the **Backward Procedure** below and run the cells.\n",
        "\n",
        "***Hints***\n",
        "- The two procedures should yield the same final answer.\n",
        "- The forward procedure code prints the $\\alpha$ values as a table while the backward procedure code prints the $\\beta$ values as a table. Think of how the values are supposed to change over the steps/rows."
      ],
      "metadata": {
        "id": "zopdN3Jh2TXU"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### **Forward Procedure**"
      ],
      "metadata": {
        "id": "1554Lh6bBRvs"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "T = len(O)\n",
        "alpha = np.zeros((T, N))\n",
        "\n",
        "# initialization\n",
        "for i in range(N): # complete this loop\n",
        "    ################### start of your code ###################\n",
        "\n",
        "    #################### end of your code ####################\n",
        "\n",
        "# inductive steps\n",
        "for t in range(1, T): # complete this loop\n",
        "    ################### start of your code ###################\n",
        "\n",
        "    #################### end of your code ####################\n",
        "print(\"alpha values:\")\n",
        "print(alpha)\n",
        "print()\n",
        "\n",
        "# final answer\n",
        "################### start of your code ###################\n",
        "answer = 0 # this variable should store your final answer\n",
        "\n",
        "#################### end of your code ####################\n",
        "print(\"P(O | lambda) = \" + str(answer))"
      ],
      "metadata": {
        "id": "FTmDSQfL3qQJ"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### **Backward Procedure**"
      ],
      "metadata": {
        "id": "lJ3Zo1buAywC"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "T = len(O)\n",
        "beta = np.zeros((T, N))\n",
        "\n",
        "# initialization\n",
        "for i in range(N): # complete this loop\n",
        "    ################### start of your code ###################\n",
        "\n",
        "    #################### end of your code ####################\n",
        "\n",
        "# inductive steps\n",
        "for t in range(T - 2, -1, -1): # complete this loop\n",
        "    ################### start of your code ###################\n",
        "\n",
        "    #################### end of your code ####################\n",
        "print(\"beta values:\")\n",
        "print(beta)\n",
        "print()\n",
        "\n",
        "# final answer\n",
        "# complete this part\n",
        "################### start of your code ###################\n",
        "answer = 0 # this variable should store your final answer\n",
        "\n",
        "#################### end of your code ####################\n",
        "print(\"P(O | lambda) = \" + str(answer))"
      ],
      "metadata": {
        "id": "WQSZrXPyBh-i"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Q1.3\n",
        "\n",
        "What is the most likely state sequence for the observed output sequence $O = V_{1}V_{3}V_{2}V_{1}$? Answer the question by finish implementing the **Viterbi Algorithm** in the following code block and run the cell."
      ],
      "metadata": {
        "id": "GVbOfDhB2r28"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "T = len(O)\n",
        "delta = np.zeros((T, N)) # stores the optimal probabilities\n",
        "psi = np.zeros((T, N), dtype = np.int8) # stores the corresponding sources\n",
        "\n",
        "# initialization\n",
        "for i in range(N): # complete this loop\n",
        "    ################### start of your code ###################\n",
        "\n",
        "    #################### end of your code ####################\n",
        "\n",
        "# inductive steps\n",
        "for t in range(1, T): # complete this loop\n",
        "    ################### start of your code ###################\n",
        "\n",
        "    #################### end of your code ####################\n",
        "print(\"delta values:\")\n",
        "print(delta)\n",
        "print()\n",
        "print(\"psi values:\")\n",
        "print(psi)\n",
        "print()\n",
        "\n",
        "# backtrack for the optimal path\n",
        "################### start of your code ###################\n",
        "optimal_path = [] # this variable should store your final answer\n",
        "\n",
        "################### start of your code ###################\n",
        "print(\"Optimal path = \" + str(optimal_path))"
      ],
      "metadata": {
        "id": "Vg63fbI3ifGV"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Q2: Markov Decision Process\n",
        "\n",
        "This problem asks you to implement both the **value iteration** and **policy iteration**. Two simple test examples are provided for you to debug your code, and at the end some more complicated cases are randomly generated to test your answers.\n",
        "\n"
      ],
      "metadata": {
        "id": "oWa_SVpT_V0t"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Simple Test Examples\n",
        "\n",
        "The next block contains both the packages you will need and two simple test cases. Please run the cell before working on the implementation.\n",
        "\n",
        "Note that the imported packages are sufficient for this problem. You can add some more if you think they can help, but obviously, you should **not** use any MDP packages."
      ],
      "metadata": {
        "id": "UIBCeGvhAyAk"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import numpy as np\n",
        "import sys\n",
        "from itertools import product\n",
        "import matplotlib.pyplot as plt\n",
        "\n",
        "# a small MDP\n",
        "states = [0, 1, 2]\n",
        "actions = [0, 1] # 0 : stay, 1 : jump\n",
        "jump_probabilities = np.matrix([[0.1, 0.2, 0.7],\n",
        "                                [0.5, 0, 0.5],\n",
        "                                [0.6, 0.4, 0]])\n",
        "for i in range(len(states)):\n",
        "    jump_probabilities[i, :] /= jump_probabilities[i, :].sum()\n",
        "\n",
        "rewards_stay = np.array([0, 8, 5])\n",
        "rewards_jump = np.matrix([[-5, 5, 7],\n",
        "                          [2, -4, 0],\n",
        "                          [-3, 3, -3]])\n",
        "\n",
        "T = np.zeros((len(states), len(actions), len(states)))\n",
        "R = np.zeros((len(states), len(actions), len(states)))\n",
        "for s in states:\n",
        "    T[s, 0, s], R[s, 0, s] = 1, rewards_stay[s]\n",
        "    T[s, 1, :], R[s, 1, :] = jump_probabilities[s, :], rewards_jump[s, :]\n",
        "\n",
        "example_1 = (states, actions, T, R)\n",
        "\n",
        "# a larger MDP\n",
        "states = [0, 1, 2, 3, 4, 5, 6, 7]\n",
        "actions = [0, 1, 2, 3, 4]\n",
        "T = np.zeros((len(states), len(actions), len(states)))\n",
        "R = np.zeros((len(states), len(actions), len(states)))\n",
        "for a in actions:\n",
        "    T[:, a, :] = np.random.RandomState(4).uniform(0, 10, (len(states), len(states)))\n",
        "\n",
        "    # randomly delete 20% of the edges\n",
        "    tuples = list(product(states, actions, states))\n",
        "    np.random.RandomState(6).shuffle(tuples)\n",
        "    for t in tuples[:len(tuples) // 5]:\n",
        "        T[t[0], t[1], t[2]] = 0\n",
        "\n",
        "    # normalizing\n",
        "    for s in states:\n",
        "        T[s, a, :] /= T[s, a, :].sum()\n",
        "    R[:, a, :] = np.random.RandomState(8).uniform(-10, 10, (len(states), len(states)))\n",
        "\n",
        "example_2 = (states, actions, T, R)"
      ],
      "metadata": {
        "id": "PGRB2VxtBpi7"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Q2.1: Value Iteration\n",
        "\n",
        "Implement value iteration by finishing the following function, and then run the cell.\n",
        "\n",
        "***Hint***\n",
        "\n",
        "- The provided code produces a figure of state values for each of the two test cases. Think of how the values are supposed to change over time steps."
      ],
      "metadata": {
        "id": "MgGAZYifBv9S"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def value_iteration(states, actions, T, R, gamma = 0.95, tolerance = 1e-2, max_steps = 1000):\n",
        "    Vs = [] # all state values\n",
        "    Vs.append(np.zeros(len(states))) # initial state values\n",
        "    steps, convergent = 0, False\n",
        "    while not convergent and steps < max_steps: # complete this loop\n",
        "        ########################### start of your code #########################\n",
        "        # compute new state values as an array, and append the array to the list Vs)\n",
        "\n",
        "        ############################ end of your code ##########################\n",
        "        convergent = np.linalg.norm(Vs[-1] - Vs[-2]) < tolerance\n",
        "        steps += 1\n",
        "    if steps == max_steps:\n",
        "        print(\"Max iterations reached before convergence.\")\n",
        "        sys.exit(1)\n",
        "    ########################### start of your code #########################\n",
        "    # compute Q from the last V array\n",
        "    # extract the optimal policy and name it \"policy\" to return\n",
        "\n",
        "    policy = # store your optimal policy here\n",
        "    ############################ end of your code ##########################\n",
        "    return np.array(Vs), policy, steps\n",
        "\n",
        "states, actions, T, R = example_1\n",
        "gamma, tolerance, max_steps = 0.95, 1e-2, 1000\n",
        "Vs_1, policy_1, steps_1 = value_iteration(states, actions, T, R, gamma, tolerance, max_steps)\n",
        "\n",
        "states, actions, T, R = example_2\n",
        "gamma, tolerance, max_steps = 0.95, 1e-2, 1000\n",
        "Vs_2, policy_2, steps_2 = value_iteration(states, actions, T, R, gamma, tolerance, max_steps)\n",
        "\n",
        "fig, axes = plt.subplots(1, 2, figsize = (8, 4))\n",
        "for i in range(Vs_1.shape[1]):\n",
        "    axes[0].plot(Vs_1[:, i], label = \"state \" + str(i))\n",
        "axes[0].set_title(\"Example 1 State Values\", fontsize = 12)\n",
        "axes[0].set_xlabel(\"Steps\", fontsize = 8)\n",
        "axes[0].legend()\n",
        "axes[0].grid(True)\n",
        "for i in range(Vs_2.shape[1]):\n",
        "    axes[1].plot(Vs_2[:, i], label = \"state \" + str(i))\n",
        "axes[1].set_title(\"Example 2 State Values\", fontsize = 12)\n",
        "axes[1].set_xlabel(\"Steps\", fontsize = 8)\n",
        "axes[1].legend()\n",
        "axes[1].grid(True)\n",
        "fig.tight_layout()\n",
        "plt.show()\n",
        "\n",
        "print(\"Optimal policy for example 1: \" + str(policy_1))\n",
        "print(\"Optimal policy for example 2: \" + str(policy_2))"
      ],
      "metadata": {
        "id": "60ITjogUB9vI"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Q2.2: Policy Iteration\n",
        "\n",
        "Implement policy iteration by finishing the following function, and then run the cell.\n",
        "\n",
        "***Hints***\n",
        "\n",
        "- The provided code prints step-wise state values and policies, again think of how the state values are supposed to change.\n",
        "- Note that the optimal policy from the two iteration procedures should match."
      ],
      "metadata": {
        "id": "sqq_nLRVB-de"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def policy_iteration(states, actions, T, R, gamma = 0.95, tolerance = 1e-2, max_steps = 1000):\n",
        "    policy_list = [] # all policies explored\n",
        "    initial_policy = np.array([np.random.choice(actions) for s in states]) # random policy\n",
        "    policy_list.append(initial_policy)\n",
        "    Vs = [] # all state values\n",
        "    Vs = [np.zeros(len(states))] # initial state values\n",
        "    steps, convergent = 0, False\n",
        "    while not convergent and steps < max_steps:\n",
        "        ########################### start of your code #########################\n",
        "        # 1. Evaluate the current policy, and append the state values to the list Vs\n",
        "\n",
        "        # 2. Extract the new policy, and append the new policy to the list policy_list\n",
        "\n",
        "        policy_list.append() # append the new policy as an array here\n",
        "        ############################ end of your code ##########################\n",
        "        steps += 1\n",
        "        convergent = (policy_list[-1] == policy_list[-2]).all()\n",
        "    if steps == max_steps:\n",
        "        print(\"Max iterations reached before convergence.\")\n",
        "        sys.exit(1)\n",
        "    return Vs, policy_list, steps\n",
        "\n",
        "print(\"Example MDP 1\")\n",
        "states, actions, T, R = example_1\n",
        "gamma, tolerance, max_steps = 0.95, 1e-2, 1000\n",
        "Vs, policy_list, steps = policy_iteration(states, actions, T, R, gamma, tolerance, max_steps)\n",
        "optimal_policy_1 = policy_list[-1]\n",
        "for i in range(steps):\n",
        "    print(\"Step \" + str(i))\n",
        "    print(\"state values: \" + str(Vs[i]))\n",
        "    print(\"policy: \" + str(policy_list[i]))\n",
        "    print()\n",
        "print()\n",
        "print(\"Example MDP 2\")\n",
        "states, actions, T, R = example_2\n",
        "gamma, tolerance, max_steps = 0.95, 1e-2, 1000\n",
        "Vs, policy_list, steps = policy_iteration(states, actions, T, R, gamma, tolerance, max_steps)\n",
        "for i in range(steps):\n",
        "    print(\"Step \" + str(i))\n",
        "    print(\"state values: \" + str(Vs[i]))\n",
        "    print(\"policy: \" + str(policy_list[i]))\n",
        "    print()\n",
        "optimal_policy_2 = policy_list[-1]\n",
        "\n",
        "print(\"Optimal policy for example 1: \" + str(optimal_policy_1))\n",
        "print(\"Optimal policy for example 2: \" + str(optimal_policy_2))"
      ],
      "metadata": {
        "id": "x90dWm0iCBPD"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Q2.3: More Tests\n",
        "\n",
        "The following block tests both of your implementations with even more random MDPs. Simply run the cell.\n",
        "\n",
        "***Hint***\n",
        "- The code produces step counts for the two iteration procedures. Think of how they should look like."
      ],
      "metadata": {
        "id": "IJFuod7cCHq3"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "steps_list_vi, steps_list_pi = [], []\n",
        "for i in range(20):\n",
        "    states = [j for j in range(np.random.randint(5, 40))]\n",
        "    actions = [j for j in range(np.random.randint(2, states[-1]))]\n",
        "    T = np.zeros((len(states), len(actions), len(states)))\n",
        "    R = np.zeros((len(states), len(actions), len(states)))\n",
        "    for a in actions:\n",
        "        T[:, a, :] = np.random.uniform(0, 10, (len(states), len(states)))\n",
        "\n",
        "        # randomly delete 20% of the edges\n",
        "        tuples = list(product(states, actions, states))\n",
        "        np.random.shuffle(tuples)\n",
        "        for t in tuples[:len(tuples) // np.random.randint(2, 20)]:\n",
        "            T[t[0], t[1], t[2]] = 0\n",
        "\n",
        "        for s in states:\n",
        "            T[s, a, :] /= T[s, a, :].sum()\n",
        "        R[:, a, :] = np.random.uniform(-10, 10, (len(states), len(states)))\n",
        "    Vs, policy, steps_v = value_iteration(states, actions, T, R)\n",
        "    Vs, policy_list, steps_p = policy_iteration(states, actions, T, R)\n",
        "    steps_list_vi.append(steps_v)\n",
        "    steps_list_pi.append(steps_p)\n",
        "print(\"Numbers of steps in value iteration: \" + str(steps_list_vi))\n",
        "print(\"Numbers of steps in policy iteration: \" + str(steps_list_pi))"
      ],
      "metadata": {
        "id": "3P_pQn7LCQaz"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Q3: Reinforcement Learning\n",
        "\n",
        "In this problem, you will be asked to implement the **$\\epsilon$-greedy Q-learning** algorithm. But first, you will need to set up the environment with the *gym* package."
      ],
      "metadata": {
        "id": "DZnUHybAFeEA"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Installation**\n",
        "\n",
        "If you have not done this yet, install gym with atari using the following command."
      ],
      "metadata": {
        "id": "q-t2TRwaGxik"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install cmake 'gym[atari]'\n",
        "!pip install gym[toy_text]"
      ],
      "metadata": {
        "id": "Za5rOdLEHA72"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Visualization on Google-Colab**\n",
        "\n",
        "This part is necessary for viewing the environment on Google-Colab. There may be other ways if you run this notebook on your own machine.\n",
        "\n",
        "If you choose to work on Google-Colab, run the following two cells and ignore errors if you see any."
      ],
      "metadata": {
        "id": "wQk-Odv1HCvn"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!apt-get install -y xvfb python-opengl > /dev/null 2>&1\n",
        "!pip install gym pyvirtualdisplay > /dev/null 2>&1\n",
        "!apt install chromium-browser xvfb"
      ],
      "metadata": {
        "id": "cz3RL4OjHErf",
        "collapsed": true
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "import matplotlib.pyplot as plt\n",
        "from IPython import display as ipythondisplay\n",
        "from pyvirtualdisplay import Display\n",
        "display = Display(visible = 0, size = (400, 300))\n",
        "display.start()"
      ],
      "metadata": {
        "id": "hWhEw9blHqLJ"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Environment**\n",
        "\n",
        "We will use the \"Taxi-v3\" environment for this task. In this environment, an agent attempts to pickup a customer and then drive to a specific location. The following block helps you understand a bit more about this environment. Feel free to modify anything inside.\n",
        "\n",
        "It is strongly recommended that you read the description of this environment [here](https://github.com/openai/gym/blob/master/gym/envs/toy_text/taxi.py)."
      ],
      "metadata": {
        "id": "yLnQHuvkHuRl"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import matplotlib.pyplot as plt\n",
        "import gym\n",
        "\n",
        "env = gym.make(\"Taxi-v3\", new_step_api = True, render_mode = \"rgb_array\").env\n",
        "\n",
        "env.reset()\n",
        "\n",
        "prev_screen = env.render()[0]\n",
        "plt.imshow(prev_screen)"
      ],
      "metadata": {
        "id": "x42DZK5FH8Mp"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "As described in the source code, there are 500 states and 6 actions in this environment.\n",
        "\n",
        "Initially, the passenger and the destination can only spawn at two distinct color tiles.\n",
        "\n",
        "Each state is encoded with the following information: taxi coordinate, passenger locations, and destination location.\n",
        "\n",
        "Passenger locations are:\n",
        "- 0: R(ed)\n",
        "- 1: G(reen)\n",
        "- 2: Y(ellow)\n",
        "- 3: B(lue)\n",
        "- 4: in taxi\n",
        "\n",
        "Destination locations are:\n",
        "- 0: R(ed)\n",
        "- 1: G(reen)\n",
        "- 2: Y(ellow)\n",
        "- 3: B(lue)\n",
        "\n",
        "Actions are:\n",
        "- 0: move south\n",
        "- 1: move north\n",
        "- 2: move east\n",
        "- 3: move west\n",
        "- 4: pickup passenger\n",
        "- 5: drop off passenger\n",
        "\n",
        "Rewards are:\n",
        "- -1 per step unless other reward is triggered\n",
        "- +20 delivering passenger\n",
        "- -10  executing \"pickup\" and \"drop-off\" actions illegally\n",
        "\n",
        "The environment includes a dictionary P storing the reward for each (state, action) pair. The information stored in P has the following structure: {state: {action: [(probability, nextstate, reward, terminated)]}}. The following code the information in the 10th state."
      ],
      "metadata": {
        "id": "s0AFMfMyICou"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "env.P[10] # shows the rewards in the 10th state"
      ],
      "metadata": {
        "id": "VlnaGHDmIJmf"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Note that all probabilities are 1 in this environment."
      ],
      "metadata": {
        "id": "HiDpeGXWINWB"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Q2.1: $\\epsilon$-greedy Q-learning\n",
        "\n",
        "**Implementation**\n",
        "\n",
        "Implement **$\\epsilon$-greedy Q-learning** by modifying the specified part in the following block. Currently it is only taking a random action."
      ],
      "metadata": {
        "id": "rgCXvFP2ISze"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def Q_learning(env, episodes = 100000, alpha = 0.1, gamma = 0.6, epsilon = 0.1):\n",
        "    Q_values = np.zeros([env.observation_space.n, env.action_space.n])\n",
        "    rewards_list = [0] * episodes\n",
        "    for episode in trange(episodes, desc = \"Training progress\"):\n",
        "        state = env.reset()\n",
        "        terminated = False\n",
        "        while not terminated:\n",
        "            ########################################################################\n",
        "            # TO DO: implement Q-learning update\n",
        "            action = env.action_space.sample()\n",
        "            next_state, reward, terminated, _, info = env.step(action) # you shouldn't change this line\n",
        "            ############################# End of your code #########################\n",
        "            rewards_list[episode] += reward\n",
        "            state = next_state\n",
        "    return Q_values, rewards_list"
      ],
      "metadata": {
        "id": "5QXWB3VLIycf"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Training and Evaluating**\n",
        "\n",
        "Run the following block to train and evaluate your implementation.\n",
        "\n",
        "The training episode number is large on purpose: using a small value may cause the agent to stuck during test time. But since a large episode value means longer training time, during debugging, you can **create a separate cell** and call the function with a smaller value. Furthermore, you are free to experiment with other hyper-parameters. For example, you can experiment with some more hyperparameter settings.\n",
        "\n",
        "**Before submitting, make sure that you have successfully finish running the following cell.**"
      ],
      "metadata": {
        "id": "lnssdeNvJc5y"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import numpy as np\n",
        "import gym\n",
        "from tqdm import trange\n",
        "\n",
        "env = gym.make(\"Taxi-v3\", new_step_api = True).env\n",
        "env.reset()\n",
        "Q_values, rewards_list = Q_learning(env)\n",
        "\n",
        "episodes = 1000\n",
        "steps_count, failures_count = 0, 0\n",
        "\n",
        "for episode in trange(episodes, desc = \"Test progress\"):\n",
        "    state = env.reset()\n",
        "    terminated = False\n",
        "    while not terminated:\n",
        "        action = np.argmax(Q_values[state])\n",
        "        state, reward, terminated, _, info = env.step(action)\n",
        "        steps_count += 1\n",
        "        failures_count += reward == -10\n",
        "\n",
        "print(\"Average steps per episode: \" + str(np.round(steps_count / episodes, 2))) # should be less than 20\n",
        "print(\"Number of failures: \" + str(failures_count)) # should be very very low\n",
        "\n",
        "# the curve should be overall increasing\n",
        "fig, ax = plt.subplots()\n",
        "ax.plot(rewards_list)\n",
        "ax.set_title(\"Training Reward Plot\")\n",
        "ax.set_xlabel(\"Episode\")\n",
        "ax.set_ylabel(\"Reward\")\n",
        "plt.show()\n"
      ],
      "metadata": {
        "id": "BJqoJOgHJgAL"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}