Boston University

CAS CS 460/660 - Introduction To Database Systems

PA2: Query Optimization in MySQL

Due on: Friday, December 1, 2017 at 11:59PM.

1. Introduction

In this assignment, you will carry out a number of exercises involving the optimization of relational queries using the MySQL query optimizer and the visualization command EXPLAIN. You need to read parts of the MySQL Documentation to be able to complete this assignment. To be specific, you need to get familiar with the EXPLAIN, SHOW PROFILE, ANALYZE and the INFORMATION_SCHEMA Tables command of MySQL (specific links are provided in the subsections).


2. Administrative stuff

This is a small hands-on project and must be done INDIVIDUALLY. Please read the entire assignment before beginning.

3. Setup

You will use the MySQL system that you used in the previous assignment. You need to create a new database and create some tables in this database. We provide the definitions of the tables and the data. You can download a zip file with the table shemata and the data from here. You can also download the files, one by one, from here.

Relation Schema:

We will use three tables in this experiment: part, supplier, partsupp, and lineitem.

·       part ( p_partkey integer, p_name varchar(55),  p_mfgr character(25), p_brand character(10), p_type varchar(25), p_size integer, p_container character(10), p_retailprice numeric(20,2), p_comment varchar(23), primary key (p_partkey));

·       supplier ( s_suppkey integer, s_name char(25), s_address varchar(40), s_nationkey integer, s_phone character(15), s_acctbal numeric(20,2), s_comment varchar(101), primary key (s_suppkey));

·       partsupp (ps_partkey integer, ps_suppkey integer, ps_availqty integer, ps_supplycost numeric(20,2), ps_comment varchar(199), primary key(ps_partkey, ps_suppkey));

·       lineitem( l_orderkey integer, l_partkey integer, l_suppkey integer, l_linenumber integer, l_quantity numeric(20,2), l_extendedprice numeric(20,2), l_discount numeric(3,2), l_tax numeric(3,2), l_returnflag character(1), l_linestatus character(1), l_shipdate date, l_commitdate date, l_receiptdate date, l_shipinstruct character(25), l_shipmode character(10), l_comment varchar(44), primary key (l_orderkey, l_linenumber);

 

First create a database and create the tables. You can use the create table statements in the schema.sql file.
Then,  exit mysql and login again to mysql using the following command:

 

>  mysql  --local-infile  -uroot  -p

 

Then, you can load the data using the load command from the files that you downloaded. To load the part table, you do:

 

mysql>  load data local infile 'path_in_your_laptop /part.tbl' into table part fields terminated by '|';

 

You need to replace “path_in_your_laptop” with the actual path that you store the data files.

 

You can see also here for more: http://dev.mysql.com/doc/refman/5.7/en/load-data.html

Useful Documentation:

4. Exercises

In general, use EXPLAIN FORMAT=JSON to get the evaluation plan because it gives much more information about the plan. Use the actual execution of the query on terminal or profile information for query execution times.
To use the Profiles you need to set the profile on first using: SET profiling = 1;

4.1 Statistics of the tables (15%)

We will first examine the statistics for table lineitem. Answer the following questions.

1.     How many records are there actually in “lineitem”? What is the estimated value by the query optimizer? How do you find these values (command or SQL)?

  1. Is the value used by the query optimizer exact? If not, why?

4.2 Index on perfect match query (25%)

We will check how index affects query optimization and performance.

Examine the following query:                

SELECT * FROM lineitem WHERE L_TAX = 0.07;

  1. What is the estimated total cost of executing the best plan? What does the cost of a plan mean in MySQL?
  2. What is the estimated result cardinality for this plan? How does the query optimizer obtain this value? Is it a reasonable one?
  3. Which access method (access path) does the optimizer choose?

Create an index “ltax_idx” on the attribute “L_TAX”.

  1. Which access method does the optimizer consider to be the best now? Is the estimated result cardinality better now? Why?
  2. Compare the two plans (without and with index). Explain briefly why access method in (4) is cheaper than the previous one.

4.3 Index on range select (20%)

Consider the following query:

SELECT * FROM lineitem WHERE L_QUANTITY < 45;

  1. How many tuples does the query optimizer think will be returned? What is the estimated total cost?
  2. What is the access method used?

Create an index “l_qty_idx” on the attribute “L_QUANTITY”. Consider now the following query:

SELECT * FROM lineitem WHERE L_QUANTITY < 3;

  1. What is the estimated total cost now? Is it correct? In what order would the tuples be returned by this plan?
  2. Explain why one of the access methods is more expensive than the other.

4.4 Join algorithm (20%)

Consider the following query:

SELECT DISTINCT (s_name)

FROM supplier, partsupp

WHERE s_suppkey = ps_suppkey AND ps_availqty < 40;

Answer the follow questions:

  1. Write down the best plan estimated by the optimizer (in plan tree form). What is the estimated total cost?
  2. What is the join algorithm used in the plan? Explain how the system reads the two relations (what access path uses).
  3. According to the optimizer, how many tuples will be retrieved from partsupp? How many from supplier? Are these estimations correct?
  4. Can you add an index to improve the performance of the plan? Which index you will create and on which attribute? What is the new plan that is executed and what is its cost?
  5. After you created the index, check the estimation of the tuples retrieved from partsupp. Is it correct? If yes, why?

4.5 Three-Way Join (20%)

Consider the following query:

SELECT p_name, s_name

FROM part, supplier, partsupp

WHERE s_suppkey = ps_suppkey AND p_partkey = ps_partkey;

 

  1. Write down the best plan estimated by the optimizer. List the joins and access methods it uses, and the order in which the relations are joined. What is the estimated cost of this plan? What is the actual query execution of this query.
  2. Can you add an index to improve the query performance? You are free to add any index that you like on any table. What is the new plan now? What is the new estimated cost? Is the new plan really better in practice than the previous one (does the actual query runs faster now)?
  3. Modify the query by adding a condition AND ps_availqty < 10 in the WHERE clause. What are the differences between this plan and the one in (1)? What is the advantage of choosing this plan over the previous one.

5. Submission instructions

You should turn in brief answers to questions by Friday, December 1, 2017 at 11:59PM. You can either return your answers in hard copy in the CS460 dropbox or create a file and use gsubmit. The plans should be written down in the tree form that we discussed in class.