I work with the Large-Scale Distributed Systems (LSDS) group.
A wide variety of topics interest me. Much of my work
has been in data management and processing, developing both mathematical models for prediction and optimization and systems with practical application. I also have a tendency toward
distributed and cloud systems. I have omitted projects from this list where my contributions did not and are not expected to lead to publications.
One well publicized promise of the information age is expanding our ability to develop richer analysis of big data in search of underlying information. To this end there has been a focus on the ability to handle larger amounts of data in hopes that processing more data provides more information. While this has produced some amazing tools, it ignores one of the main dimensions of creating in depth analysis – the variability of the analysis function. With Meta Data Flows (MDFs) we set about creating a tool for users to specify a solution space of many ways to analyse a set of data in different ways and compare the outputs of those analyses against each other. MDFs automatically explore this solution space and find the user's most desired outcome(s) without the need to change the code or deployment configuration.
SquirrelJoin (sub-project of NaaS)
Distributed joins partition data across the network in order to compute results on subsets of the data in parallel. Unfortunately, records are not always stored on the same node where they are processed, so the results is a large network load. If the network is shared with other applications there is an imbalance in the resources available to the join. We attempt to detect this skew and reroute traffic to machines on underutilized portions of the network.
Optimal Aggregation Overlays (PhD Thesis)
Much of my PhD studies involved optimizing aggregation
overlays of compute-aggregate problems. Many problems in computer
science can actually be split into
two phases of computation and aggregation. Computation runs on a subset
of a large dataset on a local node in a cluster and outputs an
intermediate results. Intermediate results are aggregated in a
distributed manner. As we rigorously show (see our paper
INFOCOM'15) through mathematical modeling and analysis there is a
very simple property of the aggregation function
which can guide how to construct a theoretically optimal overlay.
Creating theoretically optimal overlays isn't sufficient to truly show
the relevance of our work. In
stripped of some of our simplifying assumptions, there is always a
chance that things will act differently. That's why we created a system
based around the compute-aggregate model to implement and test our
suppositions. In addition to generated data used to microbenchmark the
system we run more practical application on real world data (preliminary
results in HotCloud'14, and an extended paper in submission).
Top-k Matching (PhD Thesis)
For several applications, most notably serving advertisements, it is
useful to find a small subset of subscribing entities whose interests
are best served to match an event. In order to do this we first have to
determine the parameters for a scoring mechanism. We expanded the
expressiveness of the preference definition and scoring mechanisms over
prior art. Then we created a matching system based on existing
datastructures and a straighforward algorithm which efficiently finds
best matching subscribing entities for each incoming event
dynamic system . As
shown in our Middleware'14 paper, our approach is experimentally
competitive with the prior art without supporting the increase in
expressiveness, and performs much better when modifying the approaches
to include the new expressiveness.
TCAMS are very useful in software defined networking, among other
applications. They allow very quick matching on rules with multiple
constraints. The problem is they are very expensive, both as a hardware
component and in terms of energy consumption. In order to use TCAMs to
their fullest potential we have to find a way to distill the very
minimal information needed for matching. That is exactly what we do
starting with our HotSDN'13 paper, continuing with our SIGCOMM'14 paper,
and further continuing through a journal paper appearing in ToN in
Statistical Measures for Compression Savings Prediction
Compression algorithms attempt to decrease the amount of space a dataset
requires by exploiting the uneveness in the amount of information each
bit imparts. When sequences of bits (bytes) are produced, often some are
used more than others, which means leads to this uneveness of
information. Based on this premise, it seems like it should be possible
to grab a couple of statistical descriptors of a dataset and use that
information to predict the effectiveness of a compression algorithm. In
my undergraduate thesis I explore a small handful of potential
predictors and their accuracy in predicting the savings provided by a
small handful of compression algorithms.