Metric Clustering
Claire Kenyon, Ecole Polytechnique and Institut Universitaire de
France:
Given n data items in a general metric space, how can we
efficiently
partition them into k clusters of "similar" items? There
are many models
for this ubiquitous problem, which arises in application areas such
as data mining, image recognition, medical imaging, web analysis,
etc. One
measure of the quality of a k-clustering is the sum of all pairwise
distances inside the clusters, which must be minimized. We discuss
techniques and algorithms, first for the complementary problem,
which can
be seen as a metric analog of Max-Cut in graphs, then for
2-clustering,
and finally sketch extensions to variants with other objective
functions or with cardinality constraints. The algorithms are based
on
|random sampling.
|
Perfect Phylogeny and
Haplotype Assignment
Eran Halperin, Princeton University
Each person's genome contains two copies of each chromosome,
one
inherited from the father and the other from the mother. A
person's
genotype specifies the pair of bases at each site, but does not
specify which base occurs on which chromosome. The sequence of
each
chromosome separately is called a haplotype. The determination of
the
haplotypes within a population is essential for understanding
genetic
variation and the inheritance of complex diseases. The haplotype
mapping project, a successor to the human genome project, seeks to
determine the common haplotypes in the human population.
In this talk I will focus on the theoretical aspects of the
reconstruction of haplotypes from genotypes or from 'noisy'
haplotypes. The problems arising in this area are of great
practical
importance, but at the same time, they motivate interesting
theoretical
problems. I will describe relations of the problems to phylogeny
problems, to the set cover problem and others.
The talk will be self contained - no biological background is
needed.
Joint work with Eleazar Eskin and Richard M. Karp.
|
Clustering with Qualitative
Information Tony Wirth, Princeton University
We consider the problem of clustering a collection of elements based on
pairwise judgments of similarity and dissimilarity. Bansal, Blum and Chawla cast
the problem thus: given a graph G whose edges are labeled "+"
(similar) or "-" (dissimilar), partition the vertices into clusters so
that the number of pairs correctly (resp incorrectly) classified with respect to
the input labeling is maximized (resp minimized). Complete graphs, where the
classifier labels every edge, and general graphs, where some edges are not
labeled, are both worth studying. We answer several questions left open in the
paper of Bansal et al.
In this presentation I will show a factor 4 approximation algorithm for
minimization on complete graphs, and a factor O(log n) approximation for general
graphs. For the maximization version, a PTAS (polynomial time approximation
scheme) for complete graphs is shown by Bansal et al.; I will demonstrate a
factor 0.7664 approximation for general graphs, using semi-definite programming.
If time permits, I will summarize the reductions used in showing that some of
these problems are APX-hard.
This is joint work with Moses Charikar and Venkatesan Guruswami and will
appear in FOCS 2003.
|
Frugality in Path Auctions
Edith Elkind, Princeton
We consider the problem of picking (buying) an inexpensive s-t
path in a graph where edges are owned by independent (selfish)
agents, and the cost of an edge is known to its owner only.
We focus on minimizing the buyer's total payments.
First, we show that any mechanism with (weakly) dominant
strategies
(or, equivalently, any truthful mechansim)
for the agents can force the buyer to make very large payments.
Namely, for every such mechanism,
the buyer can be forced to pay c(P) + n/2*(c(Q)-c(P)),
where c(P) is the cost of the shortest path, c(Q) is
the cost of the second-shortest path, and n is the number
of edges in P.
This extends the previous work of Archer and Tardos, who showed
a similar lower bound for a subclass of truthful mechanisms
called min-function mechanisms. Our lower bounds
have no such limitations on the mechanism.
Motivated by this lower bound, we study a wider class of
mechanisms for this problem, namely ones that admit
Nash equilibrium strategies.In this class,
we identify the optimal mechanism with regard to total
payment. We then demonstrate a separation in terms of average
overpayments between the classical
VCG mechanism and the optimal mechanism
showing that under various natural distributions
of edge costs, the optimal mechanism pays at most logarithmic
factor more than the actual cost, whereas VCG
pays $\sqrt{n}$ times the actual cost. On the other hand,
we also show that the optimal mechanism does incur at least a
constant
factor overpayment in natural distributions of edge costs. Since our
|mechanism is optimal, this gives a lower bound on all mechanisms
with Nash equilibria.
This is joint work with Amit Sahai and Ken Steiglitz.
|
The Ising model on trees: Boundary
conditions and mixing time Dror Weitz, UC Berkeley
We give the first comprehensive analysis of the effect of
boundary
conditions on the mixing time of the Glauber dynamics for the
Ising
model. Specifically, we show that the mixing time on an $n$-vertex
regular tree with plus-boundary remains $O(n\log n)$ at all
temperatures (in contrast to the free boundary case, where the
mixing time is not bounded by any fixed polynomial at low
temperatures). We also show that this bound continues to hold in
the presence of an arbitrary external field. Our results are
actually
stronger, and provide tight bounds on the log-Sobolev constant
and the spectral gap of the dynamics. In addition, our methods
yield simpler proofs and stronger results for the mixing time in
the
regime where it is insensitive to the boundary conditions. Our
techniques also apply to a much wider class of models, including
those with hard-core constraints like the antiferromagnetic Potts
model at zero temperature (proper colorings) and the hard-core
lattice gas (independent sets).
In the talk, I will start by reviewing the Ising model on
trees, the
notion of boundary conditions and the Glauber dynamics. I will
then describe the main ideas allowing us to perform analysis that
is specific to a boundary condition, and present a simple and very
useful criterion for $O(n\log n)$ mixing time on a regular tree.
I will then illustrate the usefulness of this criterion in the
context
of the Ising model and a plus-boundary condition. Time allowing,
I will use the same criterion in the context of other models,
specifically, the independent-sets model .
Joint work with Fabio Martinelli and Alistair Sinclair.
|
Approximation algorithms for
Asymmetric TSP by Decomposing Directed Regular Multigraphs
Maxim Sviridenko, IBM
A directed multigraph is said to be d-regular if the indegree
and
outdegree of every vertex is exactly d. By Hall's theorem one can
represent such a multigraph as a combination of at most n^2 cycle
covers each taken with an appropriate multiplicity. We prove that
if
the d-regular multigraph does not contain more than d/2 copies of
any 2-cycle then we can find a similar decomposition
into O(n^2) pairs of cycle covers where each 2-cycle occurs in
at most one component of each pair.
Our proof is constructive and gives a polynomial algorithm to
find such a decomposition. Since our applications only need one
such a pair of cycle covers whose weight is at least the average
weight of all pairs,
we also give a simpler algorithm to extract a single such pair.
This combinatorial theorem then comes handy in rounding a
fractional
solution of an LP relaxation of the maximum and minimum TSP
problems.
For maximum TSP, we obtain a tour whose weight is at least 2/3 of
the
weight of the longest tour, improving a previous 5/8
approximation.
For minimum TSP we obtain a tour whose weight is at most
0.842 log_2 n times the optimal, improving a previous 0.999log_2 n
approximation.
Utilizing a reduction from maximum TSP to the shortest superstring
problem
we obtain a 2.5-approximation algorithm for the latter problem
which is again
much simpler than the previous one. Other applications of the
rounding procedure
are approximation algorithms for maximum 3-cycle cover (factor
2/3, previously 3/5)
and maximum asymmetric TSP with triangle inequality (factor 10/13,
previously 3/4).
|
Tight Bounds for Approximate Nearest
Neighbour Search
Amit Chakrabarti, Dartmouth
We consider the approximate nearest neighbour search problem
(ANN) on
the d-dimensional Hamming Cube. We show that a randomised cell
probe
algorithm that uses polynomial storage and poly(d) word size
requires a
worst case query time of $Omega(log log d / log log log d)$. The
approximation factor may be as loose as $2^{log^{1-eta} d}$ for
any
fixed eta > 0. This fills a major gap in the study of ANN since
all
earlier lower bounds either did not allow randomisation or did not
allow
approximation.
Generalising the dimension reduction techniques in the ANN
algorithm of
Kushilevitz et al., we then give a cell probe upper bound which
matches
the lower bound up to a constant, for any constant approximation
factor
eps > 0.
Our lower bound uses information theoretic techniques for
communication
complexity, a theme that has been prominent in recent research.
Joint work with Oded Regev.
|
Encryption of High-Entropy Sources:
Fooling an Unbounded Adversary with a Short Key
Yevgeniy Dodis, NYU
Russell and Wang (RW02) recently introduced an elegant,
information-theoretic notion called "entropic security of
encryption":
they require a condition similar to semantic security (GM84) but
only
when the distribution on messages has high min-entropy. They show
that
this notion of security can be achieved with very short keys, for
sufficiently entropic messages spaces.
We show that entropic security is equivalent to *indistinguishability*
of encryptions of messages from high min-entropy sources -- an
equivalence inspired by the conventional notion of
indistinguishability (GM84). This equivalence allows us to give a
variety of constructions which use key length (n - t - 2 log(1/eps))
for n-bit messages spaces of entropy at least t, with error
parameter
eps. The constructions can be viewed as special randomness
extractors
with an additional property to allow decryption.
The same equivalence also allows us to prove lower bounds on
entropic
security, either by direct argument or by reduction to lower
bounds on
extractors.
Joint work with Adam Smith (MIT).
|
Zero-Knowledge Sets
Silvio Micali, MIT
Despite the centrality of sets in Mathematics, no general
zero-knowledge
treatment of sets existed to date.
We fill this gap by showing that a polynomial-time Prover can commit
to an
arbitrary set S and then, for any sequence of potential elements X
prove X
is in S or X is not in S, whichever the case may be, without
revealing any
more than implied by these mere assertions.
Our implementation, based on the discrete logarithm problem, is
non-interactive and extremely efficient.
Joint Work with Michael Rabin and Joe Kilian
|
Bounded-Concurrent Secure Two-Party
Computation in a Constant Number of Rounds
Alon Rosen, MIT
We consider the problem of constructing a general protocol for
secure two-party computation in a way that preserves security
under concurrent composition. In our treatment, we focus on the
case where an a-priori bound on the number of concurrent sessions
is specified before the protocol is constructed (a.k.a.~bounded
concurrency). We make no set-up assumptions.
Lindell (STOC 2003) has shown that any protocol for
bounded-concurrent secure two-party computation, whose security is
established via black-box simulation, must have round complexity
that is {\em strictly larger} than the bound on the number of
concurrent sessions. In this paper, we construct a (non black-box)
protocol for realizing bounded-concurrent secure two-party
computation in a {\em constant} number of rounds. The only
previously known protocol for realizing the above task required
more rounds than the pre-specified bound on the number of sessions
(despite usage of non black-box simulation techniques).
Our constructions rely on the existence of enhanced trapdoor
permutations, as well as on the existence of collision-resistant
hash functions. (Both assumptions are implied by the hardness of,
e.g., factoring large integers.)
Joint work with Rafael Pass.
|
Navigating nets: Simple algorithms for
proximity search.
Robi Krauthgamer, IBM Almaden
Nearest-neighbor search is the following problem: Given a set $S$
of
$n$ points in a metric space $X$, preprocess $S$ so as to
efficiently
find the point in $S$ that is closest to a query point $q\in X$.
Recently, there has been a significant interest in the case of
general
metrics, because in many practical applications the underlying
metric
is not Euclidean (or another $l_p$-metric).
We devise for this problem a simple and efficient dynamic data
structure
(i.e., supports insertions and deletions to $S$) that answers
$1+\epsilon$ approximate nearest neighbor queries for any
$\epsilon>0$. Our algorithms are deterministic and provably
correct
with no assumptions on the metric. Their (time and space) complexity
depends on the dimensionality of the data set $S$, which we measure
using a very natural notion borrowed from functional analysis. For
example, if $S$ has constant dimension, the above operations take
logarithmic time (for fixed $\epsilon>0$). Our data structure
outperforms the metric skip list of Karger and Ruhl [STOC, 2002] in
several aspects and is conceptually simpler.
[Joint work with James R. Lee.]
|
Measured descent: New embeddings for finite
metrics
James Lee, UC Berkeley
We give a new proof of Bourgain's theorem based on metric
decomposition
and a technique we call "measured descent." This provides
a unified
framework for the two primary methods of constructing Frechet-type
embeddings, those based on the work of Bourgain and of Rao. Our
technique yields some interesting refinements of Bourgain's theorem
in terms of important parameters of the metric space--for instance,
we
show that every n-point metric embeds into a Euclidean space with
distortion O(sqrt{dim(X) log n}), where dim(X) is the "metric
dimension"
of X (which is always at most log n).
We use this technique to answer a number of open questions. For
instance,
our embeddings are volume-respecting for subsets of arbitrary size;
this
answers positively an open question of Feige. Another application
achieves
the optimal dimension necessary to embed planar metrics into l_infty,
answering a question of Rabinovich.
[This is joint work with R. Krauthgamer, M. Mendel, and A. Naor.]
|
On Conflict-Free Colorings
Shakhar Smorodinsky, MSRI
Motivated by frequency assignment problems in cellular networks,
we study
some coloring problems of the following flavor:
What is the minimum number $f(n)$ such that one can assign colors
to any
set $P$ of $n$ points in the plane, using a total of at most $f(n)$
colors
and such that this coloring have the following property (which we
refer to
as Conflict Free coloring):
For any disc $d$ in the plane, with nonempty intersection with $P$,
there
is at least one point of $P$ inside $d$ which has a unique color
among the
points inside $d$.
We show that $f(n) = O(\log n)$. This is asymptotically tight in
the worst case.
We extend this result to many other classes of ranges (other than
disks),
and also to the dual type of problems, where we want to color a
given set
of ranges, so that for each point $p$ there is a range with a unique
color
among the ranges that contain $p$.
Some parts of this talk is joint work with Guy Even, Zvika Lotker
and Dana
Ron (Tel-Aviv University) and some other parts are joint work with
Sariel
Har-Peled (UIUC).
|
A Brief History of Generative Models
for Power Law and Lognormal
Distributions, and An Application to File Size Distributions
Michael Mitzenmacher, Harvard
We examine the the long history of power law and lognormal
distributions, showing that although these distributions are
relatively
new to the computer science community, they have a long and
interesting
history in other communities. We also describe how a new generative
model
from economics leads us to a new model for file sizes, explaining
why
they might have a power law distribution.
|
|