Project 4: Mystery Data Structures

Due: Thursday, 2 Aug 2012, at 23:59:59 PDT

Overview

In contrast to P1-P3, in which you implemented a data structure, in P4 you will be examining data structures (using computational methods) whose types you do not know. You must perform experiments, in the form of measurements of the time cost of different operations performed on the mystery data structures, to deduce the data structures' types.

Each student has been assigned a set of 5 data structures whose types must be determined. To "fetch" each of these data structures, you should call MysteryDataStructure.getMysteryDataStructure(cs12UserID, i, new Integer(0)). The first parameter must be your CSE 12 user ID -- this is crucial because each student receives a different set of data structures to analyze! The second parameter i specifies which of the mystery data structures (0-4) to "fetch". The third parameter can be any object of type T Comparable<? super T> -- reasonable examples include Integer and String. The purpose of this third parameter is to let you specify the type of element you want your mystery data structure to store. Integer should work fine and there's no compelling reason (that I see) to change it, but you have that flexibility -- just instantiate a new Whatever() and pass it in as the third parameter to getMysteryDataStructure. The MysteryDataStructure.getMysteryDataStructure returns an instance of (interface) type Collection12. The DeduceTheMysteryDataStructure.java file should get you started. Also see the Javadoc for more information on the mystery data structure and related classes.

Once you've "fetched" each of the 5 data structures, you will need to perform some experiments on it to deduce its type. There are four possible data structures that each mystery data structure might be, and each has associated time cost properties:

Doubly-linked list. O(1) time cost for add, O(n) for remove and contains in worst-case.
Heap. O(log n) time cost for add, O(n) for remove and contains in worst-case.
Binary search tree (self-balancing). O(log n) time cost for add, remove, and contains in worst-case.
Hash table. O(1) time cost for add and contains in average-case.

Your 5 assigned mystery data structures will be chosen from the pool above. There is no guarantee that you will receive one of each -- each of the 5 mystery data structures is chosen independently and uniformly at random from the above set of 4. In addition, you may receive multiple data structures of the same type that have slightly different performance characteristics -- their asymptotic complexities would have to be the same, of course, but the constant factors associated with the time costs could vary substantially.

You may (and should) rely on the time cost properties to deduce the identity of each mystery data structure. In particular, your measurement code should measure the time cost of various operations (e.g., add, remove, contains) for various numbers of elements N stored in the container. Of course, to populate the container with N elements, you'll need to call the add method on the mystery data structure. To measure the time of various operations, you should use the ArtificialClock class, in particular the getNumTicks method, which returns the "current time" of the simulated clock. An example is shown in the boilerplate DeduceTheMysteryDataStructure.java file.

Compiling and running

You will need to have the P4Stuff.jar file in the current working directory; this file is necessary both for compilation and execution: To compile: javac -cp .:P4Stuff.jar DeduceTheMysteryDataStructure.java. To run: java -cp .:P4Stuff.jar DeduceTheMysteryDataStructure.

Requirements

Note that you may not use Java reflection or related means to "hack" your way into the identity of each data structure. You must run experiments based on the time costs of various Collection12 operations. Submission requirements:

You need to create a single PDF file Analysis.pdf containing:
- The type (linked list, heap, binary search tree, or hash table) of each of your mystery data structures, indexed 0-4.
- Empirical evidence that led you to infer the type you specified. This evidence should take the form of graphs, showing the time cost of various Collection12 methods as a function of N, the number of data stored in the container. The purpose of each graph is to reveal the asymptotic time cost of the various operations. Given the inferred asymptotic complexities of various Collection12 operations, you can deduce the identity of mystery data structure. In addition to the graphs, you should also give a short written justification for how you decided the identity of each structure. For example, you might write: "Graph 1 shows the time cost of the add method for Mystery Data Structure 0. The slope of the curve grows smaller as N grows larger, and the curve is clearly O(log n). Based on that plus other evidence (etc. etc.) we conclude that the mystery data structure must be a binary search tree."
In addition, you need to submit your DeduceTheMysteryDataStructure.java file containing your code that runs the experiments (time measurements on each mystery data structure).

Sample graphs

Bad graph that was generated without averaging over many trials. The data are so noisy that it's hard to infer much of a trend.
Good graph that was generated with averaging over many trials. The graph (generated from the same mystery data structure as above!) is clearly O(log n).

Both graphs plot the time cost of the contains(o) of one of my own mystery data structures.

How big should n be?

For this assignment, n=10000 should be completely sufficient. In my own solution, I use the following different values for n:
final int[] Ns = { 1, 2, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000 };
If you try to add a huge amount of data to your mystery data structure, the Java virtual machine's (JVM) memory may be exhausted. In this case, if you try to call add(o) to any of the P4 data structures, the JVM will simply terminate due to lack of memory.

Hints

It will be important to average experimental results to reduce the noise in your graphs. The simulated running times (as measured by the ArtificialClock) will vary somewhat from trial to trial.
To create graphs showing the time costs of various operations, your DeduceTheMysteryDataStructure.java experimental code should output data that you can then plot in whatever graph program you prefer (e.g., gnuplot, Matlab, Excel, whatever).

Extra Credit

For extra credit you can implement the AlgorithmClassifier interface -- see the Javadoc for more information. The job of the classifier is to take as input a mystery data structure and then output what kind of data structure it is automatically. Your classifier will be tested on a large number of data structures, and guessing will be penalized during evaluation. You should implement your classifier in a class called AlgorithmClassifierImpl.

Grading

The maximum score on P4 is 9 points. The extra credit is worth 5 points maximum.

Submission

Submit your work using the bundleP4 script in the directory containing your DeduceTheMysteryDataStructure.java and Analysis.pdf files. If you choose to do the extra credit, then you should also submit your AlgorithmClassifierImpl.java file (otherwise, don't worry about this file). Make sure you are logged on to ieng6.ucsd.edu using your CSE 12-specific account. As always, make sure to submit before the deadline -- the script will show you (or anyone else) no mercy!