39  Download (0)

Full text


DATABASE DESIGN II - 1DL400 Spring 2014

A course on modern database systems

Tore Risch

Uppsala Database Laboratory

Department of Information Technology, Uppsala University, Uppsala, Sweden


Introduction to Indexing

Elmasri/Navathe ch 16 and 17 Padron-McCarthy/Risch ch 21 and 22

Tore Risch

Uppsala Database Laboratory

Department of Information Technology, Uppsala University, Uppsala, Sweden


Tore Risch

Uppsala University, Sweden

•Physical Database Design:

E.g by indexes:

- Permit fast matching of records in table satisfying certain search conditions (predicates).

- Critical for scalable access


New applications may require data and index structures that are not supported by the DBMS.

E.g. calendars, numerical arrays, geographical data, text, etc.

Extensible DBMSs needed where user can plug-in own indexing and search algorithms

Database Design


Tore Risch

Uppsala University, Sweden

•DBMS is designed to handle very large amounts of data.

=> 10000 elements is considered small.

•DBMS should scale:

=> Performance should not degrade when the database grows.

•How to get scalability:

=> Need index structures that are maintained when database is updated.




• Files of records

– Operations on files

• Scans

– Operations on scans

• Ordered and unordered files

– Index sequential and hash files

• Index concepts

– Types of indexes – Clustering indexes

– Ordered indexes, B-trees, B+ trees – Unordered indexes, hash indexes

– Index properties and evaluation metrics


Files of records

A file managed by a DBMS is a sequence of records, where each record is a collection of data values representing a tuple.

A file descriptor (or file header) includes

– Meta-information that describes the file and the records it stores, such as the attribute names and data types of the fields in the records.

– The records in a file are usually uniform, i.e. they are of the same size and contain the same kind of data values in each position.

– File records are grouped into blocks of records. The file descriptor contains disk the addresses of the file blocks.

The blocking factor bfr (or block size) for a file is the (average) number of file records stored in a disk block.


Operations on disk files

• Typical file operations include:

– OPEN: Readies the file for access, and associates a pointer called a cursor that will refer to a current file record representing a current tuple.

– FIND: Searches for the first record in a file that satisfies a certain condition, and makes it the cursor position.

– FINDNEXT: Searches for the next record from the current cursor position that satisfies a certain condition, and makes it the current cursor position.

– READ: Reads the record in the current cursor position into program variables.

– DELETE: Removes the record at the current cursor position from the file, usually by marking the record to indicate that it is no longer valid.

– MODIFY: Changes the values of some fields in the record of the current

cursor position, i.e. copies values from program variables to the current record.

– INSERT: Inserts a new record into the file and makes it the current cursor position., i.e. copies values from program variables into a new record and inserts it at the current cursor position.

– CLOSE: Terminates access to the file.

– REORGANIZE: Reorganizes the file records.

• For example, the records marked deleted are physically removed from the file or batches of new records are merged into the file.


Streamed access to database operators

The result of internal database operators (e.g. a join) is usually represented as scans (~streams) of tuples

– The cursors represent positions in scans.

– Cursors can be moved iteratively forward over the scans until end-of-scan is reached

SQL queries are translated by the query optimizer into programs called execution plans.

Execution plans call physical relational algebra operators that are programs that iterate over scans of tuples and iteratively produce new scans of tuples as results.

• Scans can be defined over

– file records – index records

– Reverse scans over files and indexes are also possible where scans move backwards – records produced (emitted) by some physical relational algebra operator.


Streamed access to database operators

• SCAN operators

The following the three basic operators are defined over scans:

• OPEN: Opens the scan for reading tuples and sets the cursor to the first tuple.

• NEXT: reads the next tuple in the scan a into some program variables and makes it the current cursors position of the scan

• EOS: true is there are no more tuples in the scan

• CLOSE: Closes the scan and releases all its resources.

• Scans over physical files is represented by file pointers where a new read record is read for each NEXT call.

• Scans over the result of a physical operator is usually represented as iterator objects with next, eos, and close methods.

• Intermediate results may either be materialized as a list of blocks of tuples

or generated by some code (e.g. performing a join) when next is called


Scan-based database operators

• Examples of physical algebra operators (code) using and producing scans:

– FINDALL: emit all tuples in the result of a query. Such a scan operator call is e.g. produced by the query processor to iteratively emit the result of a query.

Execution plans usually contain many FINDALL operators.

– STOPAFTER n.: iteratively emit the first n tuples in another scan

– DISTINCT: iteratively emit the tuples of another scan where duplicate tuples are removed

– SORT: emit the tuples of another in sorted orders

– INDEXSCAN: Iterative emitting the tuples matching the key of an index

– REVESEINDEXSCAN: Iteratively emitting matching index tuples in reverse order

– MATERIALIZE: Store each tuple in a scan in a file.


Unordered files

Also called a heap file.

• New records are inserted at the end of the file.

A linear search through the file records is necessary to search for a record.

– This requires reading and searching half the file blocks on the average, and is hence does not scale as the file grows.

• Record insertion is quite efficient.

• Reading the records in order of a particular field requires sorting the file records, which does not scale.


Ordered files

Also called a index sequential (ISAM) file.

File records are kept sorted by the values of an ordering field.

• Insertion is rather expensive: records must be inserted in the correct order.

– It is common to keep a separate unordered overflow (or transaction) file for batching new records to improve insertion efficiency; this is periodically merged with the main ordered file.

A binary search can be used to search for a record on its ordering field value.

– This requires reading and searching log2 of the file blocks on the average, an improvement over linear search.

• Reading the records in order of the ordering field is quite efficient.

• Efficiency is measured in # of read disk blocks.


Hash files

Hashing for disk files is called External Hashing

The file blocks are divided into M equal-sized buckets, numbered bucket0, bucket1, ..., bucketM-1.

– Typically, a bucket corresponds to one disk block in a file.

– Each bucket represents one or several records (i.e. tuples)

One of the fields of the records is designated to be the hash key of the file.

• The record with hash key value K is stored in bucket i, where i = h(K), and h is the hashing function.

• Search and update is very efficient on equality of the hash key.

Collisions occur when a new record hashes to a bucket that is already full.

– An overflow file is kept for storing such records.

– Overflow records that hash to each bucket can be linked together.

Main disadvantages of static external hashing:

– Fixed number of buckets M is a problem if the number of records in the file grows or shrinks.

– Ordered access on the hash key is quite inefficient (requires sorting the records).


Hash files (contd.)


Hash files - overflow handling

Methods for collision resolution include chaining:

– Overflow locations are kept, usually by extending the bucket with overflow positions.

– In addition, a pointer to a chain of overflow records is added to each bucket.

– A collision is resolved by placing the new record and its key in an unused overflow location


Basic index concepts

Indexes are data structures used to speed up access to sets of records stored in a database (on disk or in main memory).

– E.g., author catalog in library

An index consists of records, called index entries, of the form

The key of an index is the attribute(s) of the indexed set of tuples used to look up records, e.g. SSN, ISBN.

– The key is a record of one or several key values, which are usually stored directly in the index entry (e.g. SSN + ACCOUNT#).

The value is a tuple that stores the corresponding data values

– The value field is usually a pointer to a data record storing the values

• Two basic kinds of indices:

key value


Types of indexes

Primary Index

– Defined on an ordered data file

– The data file is ordered on one ore several key field(s)

– Includes one index entry for each block in the data file; the index entry has the key field value for the first record in the block, which is called the block


• This makes primary indexes very compact

Clustering Index

– Defined on an ordered data file

– The data file is ordered on one or several non-key field(s) unlike the primary index, which requires that the ordering field of the data file has a distinct value for each distinct value of the index.

– The index value for each distinct value of the search key points to the first data block that contains records with that field value.

• For example, think on an index of four character string used to index a files ordered on long strings.


Types of indexes (cont.)

Secondary Index

– A secondary index provides a secondary means of accessing a file for which some primary access already exists.

– It is a non-clustering index since the indexed records are not ordered by the index keys

– Retrieving all records pointed to from a secondary index can be very slow if the table is large.

Unique index

– A unique index contains key thus having a single unique value for each key – A multiple index indexes a non-key position and has a set of values for each

key value.


Clustered indexes

• clustered index • Non-clustered index


More index concepts

• The index is often specified on one attribute of the indexed tuples, but can also be specified on several attributes.

The index is sometimes called an access path to the indexed attribute.

A search over an index yields a scan whose cursor points to file records

Scans over ordered indexes are usually represented data structures representing upper and lower limits of key values in a search along with the cursor

• Indexes can also be characterized as dense or sparse:

– A dense index has an index entry for every search key value (and hence every record) in the data file. A secondary index is usually dense.

– A sparse (or nondense) index, on the other hand, has index entries for only some of the search values. A primary index is usually sparse.


Ordered indexes

Most ordered indexes use the highly scalable B-tree data structure (Bayer, Acta Informatica 1(2), 1972)

B-trees are automatically rebalanced trees with many children for each node (large fan-out)

• Each B-tree node occupies one disk block.

– One disk block at the time is read into main memory by the DBMS – The DBMS maintains a pool of disk blocks in main memory

– When pool is full disk blocks are flushed to disk

In main memory each B-tree node should have a size close to the cache line size used, to avoid memory cache misses.


B-tree indexes

• Each node is kept between half-full and completely full

• An insertion into a node that is not full is quite efficient – Just fill an empty key/value slot in the node

• If a node is full the insertion causes a split into two nodes

• Splitting may propagate to neighboring tree levels

• A deletion is quite efficient if a node does not become less than half full

• If a deletion causes a node to become less than half full, it must be merged with neighboring nodes

• On the average the nodes are 63% full

• B-trees also shown excellent in modern main memories with big differences in speed between data in caches and in the rest of the memory (G. Graefe & P-Å.

Larsson, B-tree Indexes and CPU Caches, ICDE 2001).


B-tree Structures



Nodes in a B-tree are uniform:

( (key, value, subtree) …)

• Each node in a B-tree contains an array of key/value pairs and pointers to subtrees.

• There is no difference between leaf nodes and intermediate nodes.

• Assume there are 108 tuples in the index

• Assume a block size of 8192 and that each B-tree node is on the average 63% full.

• Assume each (key,value,subtree) triple uses 12 bytes.

=> 8192*0.63/12 = 430 = triples per node

=> The average depth of the B-tree will be log430(108)=3.0

=> A key/value pair can be accessed with 3 block reads (disk accesses) on the average


The Nodes of a B+-tree

FIGURE 14.11 The nodes of a B+-tree

(a) Internal node of a B+-tree with q –1 search values.

(b) Leaf node of a B+-tree with q – 1 search values and q – 1 data pointers.





• In a B+-tree, intermediate nodes contain keys + pointers to subtrees (no values) Intermediate nodes: ((key, subtree) …)

• In a B+-tree, only the leaf-level nodes contain pointers to data records:

Leaf nodes: ((key, value) ….).

The leaf nodes are linked to enable fast scans.

• Because there are fewer data pointers in B+-trees, the fan-out (average number of children) becomes larger with B+-trees than with B-trees.

• Assume a block size of 8193 and that each B-tree node is on the average 63% full.

• Assume each (key,value) or (key,suptree) pairs uses 8 bytes.

=> 8192*0.63/8 = 645 = pairs per node


Hash indexes

Hash indexes organizes the keys with their associated value pointers, into a hash table structure.

• Hash indexes are very fast for accessing a value (or a set of values) for a given key.

Hash indexes do not support range searches.

Hash indexes require dynamic hash tables that gracefully grow or schrink without significant delays as the database is updated

• Regular hashing problematic when files grow and shrink dynamically

– Dynamic and graceful scale-up and scale-down of tables needed in DBMSs – Special hashing techniques have been developed to allow in incremental and

dynamic growth and shrinking of the number of file records in hash files.

– The most used one is called LH, Linear Hashing (Litwin, VLDB 1980) – Linear Hashing is also shown to be preferable when storing dynamic hash

tables in main memory (P-Å Larson, Dynamic Hash Tables, CACM 31(4), 1988).


Example of a hash index


Overview of LH

• Dynamic hash algorithm

• A hash file has buckets with some capacity b >> 1

• Not one hash functions but a series of hash functions h


(k) of keys k as the hash file evolves.

• Typically hash by division h


(k) = k mod 2


, i=1,2,3,4,….

• Buckets split through the replacement of h


with h


; i = 0,1,..

as the file grows

• As the file grows and the load (number of keys) of the hash table thereby increases beyond some threshold, buckets are

successively split. For split buckets b/2 keys move towards new buckets.

• Shrinking the files is similarly possible by joining buckets when

the table load decreases below some threshold.


LH File Evolution

35 12 7 15 24

b = 4 i = 0


h0(k)=k mod 20



LH File Evolution

35 12 7 15 24

b = 4 i = 0


h0(k)=k mod 20





LH File Evolution

12 24

b = 4 i = 1 p=0

h1(k)=k mod 21


35 7 15



LH File Evolution

32 58 12 24

b = 4 i = 1 p=0

h1(k)=k mod 21


21 11 35 7 15







LH File Evolution

32 12 24

b = 4 i = 1


h2(k)=k mod 22


21 11 35 7 15





LH File Evolution

32 12 24

b = 4 i = 1 p=1

h2(k)=k mod 22


33 21 11 35 7 15











LH File Evolution

32 12 24

b = 4 i = 1 p=1

h2(k)=k mod 22


33 21




11 35 7 15



LH File Evolution

32 12 24

b = 4 i = 2 p=0

h2(k)=k mod 22


33 21










11 35 7 15





Main-memory LH

• LH shown excellent also for main memory hash tables (P-Å Larson 1988).

• No need for specifying size when hash table allocated

• Dynamic grow and shrink

• No buckets but just overflow chains possible in main memory (i.e. b=1).

– As for B-tree a bucket size close to cache line size is preferred.

• When say 200% full (twice as many keys as size of table) after insert move p forward and add bucket

• When say 50% full after delete move p backwards and


Index evaluation metrics

Access operations supported efficiently. e.g.,

– Records with an explicitly specified value in the attribute (efficient get/put) – Records with an attribute value falling in a specified range of values (range


Access time as the number of records grow

Insertion time as the number of records grow

Deletion time as the number of records grow

Space overhead of representing the index

Scalable dynamic behavior: The index should not make the DBMS behave

unpredicably, such as stopping for reorganization as the number of records grows or shrinks.




Relaterade ämnen :