{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "CUdxCOrTmzzq" }, "source": [ "# CS640 Homework 5: MDP, RL and Adversarial Search\n", "\n", "This assignment covers three topics shown in the title. There is a problem for each topic.\n", "\n", "In particular, the MDP problem asks you to implement both the value iteration and the policy iteration, the RL problem asks you to implement the Q-learning algorithm, and in the adversarial search problem, you need to run the min-max algorithm by hand.\n", "\n", "### Collaboration\n", "You must answer all questions **independently**, including the coding ones.\n", "\n", "## Instructions\n", "\n", "### General Instructions\n", "In an ipython notebook, to run code in a cell or to render [Markdown](https://en.wikipedia.org/wiki/Markdown)+[LaTeX](https://en.wikipedia.org/wiki/LaTeX) press `Ctrl+Enter` or `[>|]`(like \"play\") button above. To edit any code or text cell (double) click on its content. To change cell type, choose \"Markdown\" or \"Code\" in the drop-down menu above.\n", "\n", "Most of the written questions are followed up a cell for you enter your answers. Please enter your answers in a new line below the **Answer** mark. If you do not see such cell, please insert one by yourself. Your answers and the questions should **not** be in the same cell.\n", "\n", "### Instructions on Math\n", "Some questions require you to enter math expressions. To enter your solutions, put down your derivations into the corresponding cells below using LaTeX. Show all steps when proving statements. If you are not familiar with LaTeX, you should look at some tutorials and at the examples listed below between \\$..\\$. The [OEIS website](https://oeis.org/wiki/List_of_LaTeX_mathematical_symbols) can also be helpful.\n", "\n", "Alternatively, you can scan your work from paper and insert the image(s) in a text cell.\n", "\n", "## Submission\n", "Once you are ready, save the note book as PDF file (File -> Print -> Save as PDF) and submit via Gradescope.\n", "\n", "Please select pages to match the questions on Gradescope. **You may be subject to a 5% penalty if you do not do so**." ] }, { "cell_type": "markdown", "metadata": { "id": "Ra3x75Cq0TE9" }, "source": [ "##Q1: MDP##" ] }, { "cell_type": "markdown", "source": [ "###Q1.0: Setups" ], "metadata": { "id": "5va-te6ASat6" } }, { "cell_type": "markdown", "metadata": { "id": "YTCVccBl0bxu" }, "source": [ "**Packages**\n", "\n", "The package(s) imported in the following block should be sufficient for this task, but you are free to add more if necessary. However, keep in mind that you **should not** import and use any MDP package." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RIdUg8hh0dTm" }, "outputs": [], "source": [ "import numpy as np\n", "import sys\n", "from itertools import product\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": { "id": "XxPSV1lC92Is" }, "source": [ "**Examples for testing**\n", "\n", "The following block contains two examples used to test your code. You can create more for debugging, but please add it to a different block." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JUak0AUi90_w" }, "outputs": [], "source": [ "# a small MDP\n", "states = [0, 1, 2]\n", "actions = [0, 1] # 0 : stay, 1 : jump\n", "jump_probabilities = np.matrix([[0.1, 0.2, 0.7],\n", " [0.5, 0, 0.5],\n", " [0.6, 0.4, 0]])\n", "for i in range(len(states)):\n", " jump_probabilities[i, :] /= jump_probabilities[i, :].sum()\n", "\n", "rewards_stay = np.array([0, 8, 5])\n", "rewards_jump = np.matrix([[-5, 5, 7],\n", " [2, -4, 0],\n", " [-3, 3, -3]])\n", "\n", "T = np.zeros((len(states), len(actions), len(states)))\n", "R = np.zeros((len(states), len(actions), len(states)))\n", "for s in states:\n", " T[s, 0, s], R[s, 0, s] = 1, rewards_stay[s]\n", " T[s, 1, :], R[s, 1, :] = jump_probabilities[s, :], rewards_jump[s, :]\n", "\n", "example_1 = (states, actions, T, R)\n", "\n", "# a larger MDP\n", "states = [0, 1, 2, 3, 4, 5, 6, 7]\n", "actions = [0, 1, 2, 3, 4]\n", "T = np.zeros((len(states), len(actions), len(states)))\n", "R = np.zeros((len(states), len(actions), len(states)))\n", "for a in actions:\n", " T[:, a, :] = np.random.RandomState(4).uniform(0, 10, (len(states), len(states)))\n", "\n", " # randomly delete 20% of the edges\n", " tuples = list(product(states, actions, states))\n", " np.random.RandomState(6).shuffle(tuples)\n", " for t in tuples[:len(tuples) // 5]:\n", " T[t[0], t[1], t[2]] = 0\n", "\n", " # normalizing\n", " for s in states:\n", " T[s, a, :] /= T[s, a, :].sum()\n", " R[:, a, :] = np.random.RandomState(8).uniform(-10, 10, (len(states), len(states)))\n", "\n", "example_2 = (states, actions, T, R)" ] }, { "cell_type": "markdown", "metadata": { "id": "C61z5yUe_8u0" }, "source": [ "###Q1.1: Value iteration\n", "\n", "Implement value iteration by finishing the following function, and then run the cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yjxNKNZ6_9Gj" }, "outputs": [], "source": [ "def value_iteration(states, actions, T, R, gamma = 0.95, tolerance = 1e-2, max_steps = 1000):\n", " Vs = [] # all state values\n", " Vs.append(np.zeros(len(states))) # initial state values\n", " steps, convergent = 0, False\n", " while not convergent and steps < max_steps: # complete this loop\n", " ########################### start of your code #########################\n", " # compute new state values as an array, and append the array to the list Vs\n", "\n", " ############################ end of your code ##########################\n", " convergent = np.linalg.norm(Vs[-1] - Vs[-2]) < tolerance\n", " steps += 1\n", " if steps == max_steps:\n", " print(\"Max iterations reached before convergence.\")\n", " sys.exit(1)\n", " ########################### start of your code #########################\n", " # compute Q from the last V array\n", " # extract the optimal policy and name it \"policy\" to return\n", "\n", " ############################ end of your code ##########################\n", " return np.array(Vs), policy, steps\n", "\n", "states, actions, T, R = example_1\n", "gamma, tolerance, max_steps = 0.95, 1e-2, 1000\n", "Vs_1, policy_1, steps_1 = value_iteration(states, actions, T, R, gamma, tolerance, max_steps)\n", "\n", "states, actions, T, R = example_2\n", "gamma, tolerance, max_steps = 0.95, 1e-2, 1000\n", "Vs_2, policy_2, steps_2 = value_iteration(states, actions, T, R, gamma, tolerance, max_steps)\n", "\n", "fig, axes = plt.subplots(2, 1, figsize = (6, 8))\n", "for i in range(Vs_1.shape[1]):\n", " axes[0].plot(Vs_1[:, i], label = \"state \" + str(i))\n", "axes[0].set_title(\"Example 1 State Values\", fontsize = 12)\n", "axes[0].set_xlabel(\"Step\", fontsize = 8)\n", "axes[0].legend()\n", "axes[0].grid(True)\n", "for i in range(Vs_2.shape[1]):\n", " axes[1].plot(Vs_2[:, i], label = \"state \" + str(i))\n", "axes[1].set_title(\"Example 2 State Values\", fontsize = 12)\n", "axes[1].set_xlabel(\"Step\", fontsize = 8)\n", "axes[1].legend()\n", "axes[1].grid(True)\n", "fig.tight_layout()\n", "plt.show()\n", "\n", "print(\"Optimal policy for example 1: \" + str(policy_1))\n", "print(\"Optimal policy for example 2: \" + str(policy_2))" ] }, { "cell_type": "markdown", "metadata": { "id": "_fKPsi2UBw8n" }, "source": [ "###Q1.2: Policy iteration\n", "\n", "Implement policy iteration by finishing the following function, and then run the cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "c7byfxgyBye1" }, "outputs": [], "source": [ "def policy_iteration(states, actions, T, R, gamma = 0.95, tolerance = 1e-2, max_steps = 1000):\n", " policy_list = [] # all policies explored\n", " initial_policy = np.array([np.random.choice(actions) for s in states]) # random policy\n", " policy_list.append(initial_policy)\n", " Vs = [] # all state values\n", " Vs = [np.zeros(len(states))] # initial state values\n", " steps, convergent = 0, False\n", " while not convergent and steps < max_steps:\n", " ########################### start of your code #########################\n", " # 1. Evaluate the current policy, and append the new state values as a numpy array to the list Vs\n", " # You can define a function outside of policy_iteration if you want\n", "\n", " # 2. Extract the new policy, and append the new policy as a numpy array to the list policy_list\n", "\n", " ############################ end of your code ##########################\n", " steps += 1\n", " convergent = (policy_list[-1] == policy_list[-2]).all()\n", " if steps == max_steps:\n", " print(\"Max iterations reached before convergence.\")\n", " sys.exit(1)\n", " return Vs, policy_list, steps\n", "\n", "print(\"Example MDP 1\")\n", "states, actions, T, R = example_1\n", "gamma, tolerance, max_steps = 0.95, 1e-2, 1000\n", "Vs, policy_list, steps = policy_iteration(states, actions, T, R, gamma, tolerance, max_steps)\n", "for i in range(steps):\n", " print(\"Step \" + str(i))\n", " print(\"state values: \" + str(Vs[i]))\n", " print(\"policy: \" + str(policy_list[i]))\n", " print()\n", "print()\n", "print(\"Example MDP 2\")\n", "states, actions, T, R = example_2\n", "gamma, tolerance, max_steps = 0.95, 1e-2, 1000\n", "Vs, policy_list, steps = policy_iteration(states, actions, T, R, gamma, tolerance, max_steps)\n", "for i in range(steps):\n", " print(\"Step \" + str(i))\n", " print(\"state values: \" + str(Vs[i]))\n", " print(\"policy: \" + str(policy_list[i]))\n", " print()" ] }, { "cell_type": "markdown", "metadata": { "id": "rvQqa18rDnwr" }, "source": [ "##Q1.3: More tests\n", "\n", "The following block tests both of your implementations with even more random MDPs. Simply run the cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "YS1Wr2APDpu2" }, "outputs": [], "source": [ "steps_list_vi, steps_list_pi = [], []\n", "for i in range(20):\n", " states = [j for j in range(np.random.randint(5, 40))]\n", " actions = [j for j in range(np.random.randint(2, states[-1]))]\n", " T = np.zeros((len(states), len(actions), len(states)))\n", " R = np.zeros((len(states), len(actions), len(states)))\n", " for a in actions:\n", " T[:, a, :] = np.random.uniform(0, 10, (len(states), len(states)))\n", "\n", " # randomly delete 20% of the edges\n", " tuples = list(product(states, actions, states))\n", " np.random.shuffle(tuples)\n", " for t in tuples[:len(tuples) // np.random.randint(2, 20)]:\n", " T[t[0], t[1], t[2]] = 0\n", "\n", " for s in states:\n", " T[s, a, :] /= T[s, a, :].sum()\n", " R[:, a, :] = np.random.uniform(-10, 10, (len(states), len(states)))\n", " Vs, policy, steps_v = value_iteration(states, actions, T, R)\n", " Vs, policy_list, steps_p = policy_iteration(states, actions, T, R)\n", " steps_list_vi.append(steps_v)\n", " steps_list_pi.append(steps_p)\n", "print(\"Numbers of steps in value iteration: \" + str(steps_list_vi))\n", "print(\"Numbers of steps in policy iteration: \" + str(steps_list_pi))" ] }, { "cell_type": "markdown", "metadata": { "id": "gzXO3gCjDtaB" }, "source": [ "##Q2: RL" ] }, { "cell_type": "markdown", "source": [ "###Q2.0: Setups" ], "metadata": { "id": "_7cG1EA9TIcP" } }, { "cell_type": "markdown", "metadata": { "id": "Nkii3z1sDvYX" }, "source": [ "**Install gym**\n", "\n", "First, if you have not done this yet, install gym with atari using the following command." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "PBwcZiTKDxLo" }, "outputs": [], "source": [ "!pip install cmake 'gym[atari]'\n", "!pip install gym[toy_text]" ] }, { "cell_type": "markdown", "metadata": { "id": "Htahh7JSDyyc" }, "source": [ "**Setup visualization**\n", "\n", "The following commands are necessary for viewing the environment on Google Colab. There may be other ways if you run this notebook on your own machine." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-vnWNn6UD0H0" }, "outputs": [], "source": [ "!apt-get install -y xvfb python-opengl > /dev/null 2>&1\n", "!pip install gym pyvirtualdisplay > /dev/null 2>&1" ] }, { "cell_type": "markdown", "metadata": { "id": "Ce29LoLPD4C7" }, "source": [ "**Packages**\n", "\n", "Again, the package(s) imported in the following block should be sufficient for this task, but you are free to add more if necessary. However, keep in mind that you **should not** import and use any RL package." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "noAJSYWCD-IH" }, "outputs": [], "source": [ "import numpy as np\n", "import scipy as sp\n", "import sys\n", "import gym" ] }, { "cell_type": "markdown", "metadata": { "id": "CFlYjJqMD8A9" }, "source": [ "The following block sets up visualization on Google Colab. If you see errors, try ignoring them." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Gi-TpT79D7GJ" }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "from IPython import display as ipythondisplay\n", "from pyvirtualdisplay import Display\n", "display = Display(visible = 0, size = (400, 300))\n", "display.start()" ] }, { "cell_type": "markdown", "metadata": { "id": "7Mx825ecEB2E" }, "source": [ "**Select environment**\n", "\n", "We will use the \"Taxi-v3\" environment for this task. In this environment, an agent attempts to pickup a customer and then drive to a specific location. The following block helps you understand a bit more about this environment. Feel free to modify anything inside.\n", "\n", "It is strongly recommended that you read the description of this environment [here](https://github.com/openai/gym/blob/master/gym/envs/toy_text/taxi.py)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "MQG4k638EEZ4" }, "outputs": [], "source": [ "env = gym.make(\"Taxi-v3\", new_step_api = True).env\n", "\n", "env.reset()\n", "\n", "# if you run on Google Colab\n", "prev_screen = env.render(mode = 'rgb_array')\n", "plt.imshow(prev_screen)\n", "\n", "# if you run on your own machine\n", "# env.render()" ] }, { "cell_type": "markdown", "metadata": { "id": "FrewZkQHEHWP" }, "source": [ "**Understand the gym environment**\n", "\n", "As described in the source code, there are 500 states and 6 actions in this environment.\n", "\n", "Initially, the passenger and the destination can only spawn at two distinct color tiles.\n", "\n", "Each state is encoded with the following information: taxi coordinate, passenger locations, and destination location.\n", "\n", "Passenger locations are:\n", "- 0: R(ed)\n", "- 1: G(reen)\n", "- 2: Y(ellow)\n", "- 3: B(lue)\n", "- 4: in taxi\n", "\n", "Destination locations are:\n", "- 0: R(ed)\n", "- 1: G(reen)\n", "- 2: Y(ellow)\n", "- 3: B(lue)\n", "\n", "Actions are:\n", "- 0: move south\n", "- 1: move north\n", "- 2: move east\n", "- 3: move west\n", "- 4: pickup passenger\n", "- 5: drop off passenger\n", "\n", "Rewards are:\n", "- -1 per step unless other reward is triggered\n", "- +20 delivering passenger\n", "- -10 executing \"pickup\" and \"drop-off\" actions illegally\n", "\n", "The environment includes a dictionary P storing the reward for each (state, action) pair. The information stored in P has the following structure: {state: {action: [(probability, nextstate, reward, done)]}}\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "D5F_XsL6EJVD" }, "outputs": [], "source": [ "env.P[10] # shows the rewards in the 10th states" ] }, { "cell_type": "markdown", "metadata": { "id": "zr822QA2EK05" }, "source": [ "Note that all probabilities are 1 in this environment." ] }, { "cell_type": "markdown", "metadata": { "id": "YreIrwzsEMXu" }, "source": [ "###Q2.1: Q-learning" ] }, { "cell_type": "markdown", "source": [ "**Implement Q-learning**\n", "\n", "Implement Q-learning by modifying the specified part in the following block. Currently it is only taking a random action." ], "metadata": { "id": "ptec_LduUL2I" } }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "htdhwh7VEOO0" }, "outputs": [], "source": [ "def Q_learning(env, episodes = 100000, alpha = 0.1, gamma = 0.6, epsilon = 0.1):\n", " Q_values = np.zeros([env.observation_space.n, env.action_space.n])\n", " rewards_list = [0] * episodes\n", " for episode in range(episodes):\n", " state = env.reset()\n", " done = False\n", " while not done:\n", " ########################################################################\n", " # TO DO: implement Q-learning update\n", " action = env.action_space.sample()\n", " next_state, reward, done, info = env.step(action) # you shouldn't change this line\n", " ############################# End of your code #########################\n", " rewards_list[episode] += reward\n", " state = next_state\n", " return Q_values, rewards_list" ] }, { "cell_type": "markdown", "metadata": { "id": "OEfvkG3lEavI" }, "source": [ "**Train and evaluate**\n", "\n", "Run the following block to train and evaluate your implementation.\n", "\n", "You are free to write your own code to debug or to test. For example, you can experiment with some more hyperparameter settings. But please do so in a new code block and delete it before submitting. Furthermore, keep in mind that if the number of episodes of training is too small, your agent may not learn enough information and get stuck during testing forever." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xvSdBhtyEcal" }, "outputs": [], "source": [ "env.reset()\n", "Q_values, rewards_list = Q_learning(env)\n", "\n", "episodes = 1000\n", "steps_count, failures_count = 0, 0\n", "\n", "for episode in range(episodes):\n", " state = env.reset()\n", " done = False\n", " while not done:\n", " action = np.argmax(Q_values[state])\n", " state, reward, done, info = env.step(action)\n", " steps_count += 1\n", " failures_count += reward == -10\n", "\n", "print(\"Average steps per episode: \" + str(np.round(steps_count / episodes, 2))) # should be less than 20\n", "print(\"Number of failures: \" + str(failures_count)) # should be very very low\n", "\n", "# the curve should be overall increasing\n", "fig, ax = plt.subplots()\n", "ax.plot(rewards_list)\n", "ax.set_title(\"Training Reward Plot\")\n", "ax.set_xlabel(\"Episode\")\n", "ax.set_ylabel(\"Reward\")\n", "plt.show()" ] }, { "cell_type": "markdown", "source": [ "##Q 3: Adversarial Search##" ], "metadata": { "id": "avI0DjzH9bZM" } }, { "cell_type": "markdown", "source": [ "![AdversarialSearch.png]()" ], "metadata": { "id": "xQ0lhchv9wRI" } }, { "cell_type": "markdown", "source": [ "Consider an adversarial game with **three** players acting independently and the winner is unique. Suppose we have an AI agent using the minimax algorithm to play this game and it has produced a search tree shown in the figure above. In this search tree,\n", "\n", "1. the first level (node 0) is the **maximizing** level;\n", "2. the second level (node 1 and 2) is the ***minimizing*** level;\n", "3. the third level (node 3, 4, 5, and 6) is the ***minimizing*** level;\n", "4. the fourth level (node 7 to 14) is the **maximizing** level;\n", "5. the fifth level (node 15 to 30) is the leaf level and the number beneath each node is the static evaluator score of the corresponding game state." ], "metadata": { "id": "XHRXbUj29jLR" } }, { "cell_type": "markdown", "source": [ "### Q3.1\n", "\n", "Apply the minimax algorithm without pruning. Enter the returned value for each non-leaf node (node 0 to 14) below in the dictionary M and run the cell." ], "metadata": { "id": "hXngz_S_-Pak" } }, { "cell_type": "code", "source": [ "import pandas\n", "\n", "M = {0 : 0,\n", " 1 : 0,\n", " 2 : 0,\n", " 3 : 0,\n", " 4 : 0,\n", " 5 : 0,\n", " 6 : 0,\n", " 7 : 0,\n", " 8 : 0,\n", " 9 : 0,\n", " 10 : 0,\n", " 11 : 0,\n", " 12 : 0,\n", " 13 : 0,\n", " 14 : 0\n", "}\n", "\n", "df = pandas.DataFrame(M, index = [0])\n", "df" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 81 }, "id": "J3mS0Da5-dOT", "outputId": "6c2a47e8-94b2-4cc3-a7e1-a16feee70589" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14\n", "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234567891011121314
0000000000000000
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 31 } ] }, { "cell_type": "markdown", "source": [ "## Q3.2\n", "\n", "Indicate the optimal path as a linked list of nodes (e.g., 0 - 1 - 3 - 7 - 15) the agent will choose." ], "metadata": { "id": "s_HL62ot_tf0" } }, { "cell_type": "markdown", "source": [ "**Answer**" ], "metadata": { "id": "mE3yBY8M_1oh" } }, { "cell_type": "markdown", "source": [ "## Q3.3\n", "\n", "Apply the minimax algorithm with alpha-beta pruning. List all of the nodes in ascending index order that will be pruned and run the cell." ], "metadata": { "id": "iQOLh-P4_5SZ" } }, { "cell_type": "code", "source": [ "answer = [] # enter your answer in this array\n", "answer.sort() # just for easier grading\n", "print(answer)" ], "metadata": { "id": "M9P6AA5k__qN" }, "execution_count": null, "outputs": [] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }