Research Interests

I work with the Large-Scale Distributed Systems (LSDS) group.

A wide variety of topics interest me. Much of my work has been in data management and processing, developing both mathematical models for prediction and optimization and systems with practical application. I also have a tendency toward distributed and cloud systems. I have omitted projects from this list where my contributions did not and are not expected to lead to publications.

Current Projects

MDF: Meta-Dataflows

One well publicized promise of the information age is expanding our ability to develop richer analysis of big data in search of underlying information. To this end there has been a focus on the ability to handle larger amounts of data in hopes that processing more data provides more information. While this has produced some amazing tools, it ignores one of the main dimensions of creating in depth analysis – the variability of the analysis function. With Meta Data Flows (MDFs) we set about creating a tool for users to specify a solution space of many ways to analyse a set of data in different ways and compare the outputs of those analyses against each other. MDFs automatically explore this solution space and find the user's most desired outcome(s) without the need to change the code or deployment configuration.

SquirrelJoin (sub-project of NaaS)

Distributed joins partition data across the network in order to compute results on subsets of the data in parallel. Unfortunately, records are not always stored on the same node where they are processed, so the results is a large network load. If the network is shared with other applications there is an imbalance in the resources available to the join. We attempt to detect this skew and reroute traffic to machines on underutilized portions of the network.

Previous Projects

Optimal Aggregation Overlays (PhD Thesis)

Much of my PhD studies involved optimizing aggregation overlays of compute-aggregate problems. Many problems in computer science can actually be split into two phases of computation and aggregation. Computation runs on a subset of a large dataset on a local node in a cluster and outputs an intermediate results. Intermediate results are aggregated in a distributed manner. As we rigorously show (see our paper in INFOCOM'15) through mathematical modeling and analysis there is a very simple property of the aggregation function which can guide how to construct a theoretically optimal overlay.

Creating theoretically optimal overlays isn't sufficient to truly show the relevance of our work. In practice, stripped of some of our simplifying assumptions, there is always a chance that things will act differently. That's why we created a system based around the compute-aggregate model to implement and test our suppositions. In addition to generated data used to microbenchmark the system we run more practical application on real world data (preliminary results in HotCloud'14, and an extended paper in submission).

Top-k Matching (PhD Thesis)

For several applications, most notably serving advertisements, it is useful to find a small subset of subscribing entities whose interests are best served to match an event. In order to do this we first have to determine the parameters for a scoring mechanism. We expanded the expressiveness of the preference definition and scoring mechanisms over prior art. Then we created a matching system based on existing datastructures and a straighforward algorithm which efficiently finds the k best matching subscribing entities for each incoming event in a dynamic system . As shown in our Middleware'14 paper, our approach is experimentally competitive with the prior art without supporting the increase in expressiveness, and performs much better when modifying the approaches to include the new expressiveness.

TCAM Optimization

TCAMS are very useful in software defined networking, among other applications. They allow very quick matching on rules with multiple constraints. The problem is they are very expensive, both as a hardware component and in terms of energy consumption. In order to use TCAMs to their fullest potential we have to find a way to distill the very minimal information needed for matching. That is exactly what we do starting with our HotSDN'13 paper, continuing with our SIGCOMM'14 paper, and further continuing through a journal paper appearing in ToN in March 2015.

Statistical Measures for Compression Savings Prediction (Undergraduate Thesis)

Compression algorithms attempt to decrease the amount of space a dataset requires by exploiting the uneveness in the amount of information each bit imparts. When sequences of bits (bytes) are produced, often some are used more than others, which means leads to this uneveness of information. Based on this premise, it seems like it should be possible to grab a couple of statistical descriptors of a dataset and use that information to predict the effectiveness of a compression algorithm. In my undergraduate thesis I explore a small handful of potential predictors and their accuracy in predicting the savings provided by a small handful of compression algorithms.