Theory lunches

Princeton Seminars in Theoretical Computer Science,
Fall 2003

(All events held at the Computer Science Department, 35 Olden St., unless stated otherwise)

Schedule

Sept 19	Claire Kenyon, Ecole Polytechnique and Institut Universitaire de France Metric Clustering
Sept 26	Eran Halperin, Princeton University Perfect Phylogeny and Haplotype Assignment
Oct 3	Tony Wirth, Princeton Clustering with Qualitative Information
Oct 10	Edith Elkind, Princeton Frugality in Path Auctions
Oct 17	Dror Weitz, UC Berkeley The Ising model on trees: Boundary conditions and mixing time
Oct 24	Maxim Sviridenko, IBM TJ Watson Approximation algorithms for Asymmetric TSP by Decomposing Directed Regular Multigraphs
Oct 31	No talk (Fall Break)
Nov 7	Amit Chakrabarti, Dartmouth Tight Bounds for Approximate Nearest Neighbour Search
Nov 14	No Talk (Theory day at NYU)
Nov 21	Yevgeniy Dodis, NYU Encryption of High-Entropy Sources: Fooling an Unbounded Adversary with a Short Key
Nov 28	No Talk (Thanksgiving)
Dec 5	Silvio Micali, MIT Zero-Knowledge Sets
Dec 12	Alon Rosen, MIT Bounded-Concurrent Secure Two-Party Computation in a Constant Number of Rounds
Jan 16	Robi Krauthgamer, IBM Almaden Navigating nets: Simple algorithms for proximity search
Jan 21 (Wed)	James Lee, UC Berkeley Measured descent: New embeddings for finite metrics
Jan 23	Shakhar Smorodinsky, MSRI On Conflict-Free Colorings
Jan 29 (11am-12)	Michael Mitzenmacher, Harvard A Brief History of Generative Models for Power Law and Lognormal Distributions, and An Application to File Size Distributions

Abstracts

Metric Clustering

Claire Kenyon, Ecole Polytechnique and Institut Universitaire de France:

Given n data items in a general metric space, how can we efficiently
partition them into k clusters of "similar" items? There are many models
for this ubiquitous problem, which arises in application areas such
as data mining, image recognition, medical imaging, web analysis, etc. One
measure of the quality of a k-clustering is the sum of all pairwise
distances inside the clusters, which must be minimized. We discuss
techniques and algorithms, first for the complementary problem, which can
be seen as a metric analog of Max-Cut in graphs, then for 2-clustering,
and finally sketch extensions to variants with other objective
functions or with cardinality constraints. The algorithms are based on
|random sampling.

Perfect Phylogeny and Haplotype Assignment

Eran Halperin, Princeton University

Each person's genome contains two copies of each chromosome, one
inherited from the father and the other from the mother. A person's
genotype specifies the pair of bases at each site, but does not
specify which base occurs on which chromosome. The sequence of each
chromosome separately is called a haplotype. The determination of the
haplotypes within a population is essential for understanding genetic
variation and the inheritance of complex diseases. The haplotype
mapping project, a successor to the human genome project, seeks to
determine the common haplotypes in the human population.

In this talk I will focus on the theoretical aspects of the
reconstruction of haplotypes from genotypes or from 'noisy'
haplotypes. The problems arising in this area are of great practical
importance, but at the same time, they motivate interesting theoretical
problems. I will describe relations of the problems to phylogeny
problems, to the set cover problem and others.

The talk will be self contained - no biological background is needed.

Joint work with Eleazar Eskin and Richard M. Karp.

Clustering with Qualitative Information

Tony Wirth, Princeton University

We consider the problem of clustering a collection of elements based on pairwise judgments of similarity and dissimilarity. Bansal, Blum and Chawla cast the problem thus: given a graph G whose edges are labeled "+" (similar) or "-" (dissimilar), partition the vertices into clusters so that the number of pairs correctly (resp incorrectly) classified with respect to the input labeling is maximized (resp minimized). Complete graphs, where the classifier labels every edge, and general graphs, where some edges are not labeled, are both worth studying. We answer several questions left open in the paper of Bansal et al.

In this presentation I will show a factor 4 approximation algorithm for minimization on complete graphs, and a factor O(log n) approximation for general graphs. For the maximization version, a PTAS (polynomial time approximation scheme) for complete graphs is shown by Bansal et al.; I will demonstrate a factor 0.7664 approximation for general graphs, using semi-definite programming. If time permits, I will summarize the reductions used in showing that some of these problems are APX-hard.

This is joint work with Moses Charikar and Venkatesan Guruswami and will appear in FOCS 2003.

Frugality in Path Auctions

Edith Elkind, Princeton

We consider the problem of picking (buying) an inexpensive s-t
path in a graph where edges are owned by independent (selfish)
agents, and the cost of an edge is known to its owner only.
We focus on minimizing the buyer's total payments.

First, we show that any mechanism with (weakly) dominant strategies
(or, equivalently, any truthful mechansim)
for the agents can force the buyer to make very large payments.
Namely, for every such mechanism,
the buyer can be forced to pay c(P) + n/2*(c(Q)-c(P)),
where c(P) is the cost of the shortest path, c(Q) is
the cost of the second-shortest path, and n is the number
of edges in P.
This extends the previous work of Archer and Tardos, who showed
a similar lower bound for a subclass of truthful mechanisms
called min-function mechanisms. Our lower bounds
have no such limitations on the mechanism.

Motivated by this lower bound, we study a wider class of
mechanisms for this problem, namely ones that admit
Nash equilibrium strategies.In this class,
we identify the optimal mechanism with regard to total
payment. We then demonstrate a separation in terms of average
overpayments between the classical
VCG mechanism and the optimal mechanism
showing that under various natural distributions
of edge costs, the optimal mechanism pays at most logarithmic
factor more than the actual cost, whereas VCG
pays $\sqrt{n}$ times the actual cost. On the other hand,
we also show that the optimal mechanism does incur at least a constant
factor overpayment in natural distributions of edge costs. Since our
|mechanism is optimal, this gives a lower bound on all mechanisms
with Nash equilibria.

This is joint work with Amit Sahai and Ken Steiglitz.

The Ising model on trees: Boundary conditions and mixing time

Dror Weitz, UC Berkeley

We give the first comprehensive analysis of the effect of boundary
conditions on the mixing time of the Glauber dynamics for the Ising
model. Specifically, we show that the mixing time on an $n$-vertex
regular tree with plus-boundary remains $O(n\log n)$ at all
temperatures (in contrast to the free boundary case, where the
mixing time is not bounded by any fixed polynomial at low
temperatures). We also show that this bound continues to hold in
the presence of an arbitrary external field. Our results are actually
stronger, and provide tight bounds on the log-Sobolev constant
and the spectral gap of the dynamics. In addition, our methods
yield simpler proofs and stronger results for the mixing time in the
regime where it is insensitive to the boundary conditions. Our
techniques also apply to a much wider class of models, including
those with hard-core constraints like the antiferromagnetic Potts
model at zero temperature (proper colorings) and the hard-core
lattice gas (independent sets).

In the talk, I will start by reviewing the Ising model on trees, the
notion of boundary conditions and the Glauber dynamics. I will
then describe the main ideas allowing us to perform analysis that
is specific to a boundary condition, and present a simple and very
useful criterion for $O(n\log n)$ mixing time on a regular tree.
I will then illustrate the usefulness of this criterion in the context
of the Ising model and a plus-boundary condition. Time allowing,
I will use the same criterion in the context of other models,
specifically, the independent-sets model .

Joint work with Fabio Martinelli and Alistair Sinclair.

Approximation algorithms for Asymmetric TSP by Decomposing Directed Regular Multigraphs

Maxim Sviridenko, IBM

A directed multigraph is said to be d-regular if the indegree and
outdegree of every vertex is exactly d. By Hall's theorem one can
represent such a multigraph as a combination of at most n^2 cycle
covers each taken with an appropriate multiplicity. We prove that if
the d-regular multigraph does not contain more than d/2 copies of
any 2-cycle then we can find a similar decomposition
into O(n^2) pairs of cycle covers where each 2-cycle occurs in
at most one component of each pair.
Our proof is constructive and gives a polynomial algorithm to
find such a decomposition. Since our applications only need one
such a pair of cycle covers whose weight is at least the average weight of all pairs,
we also give a simpler algorithm to extract a single such pair.

This combinatorial theorem then comes handy in rounding a fractional
solution of an LP relaxation of the maximum and minimum TSP problems.
For maximum TSP, we obtain a tour whose weight is at least 2/3 of the
weight of the longest tour, improving a previous 5/8 approximation.
For minimum TSP we obtain a tour whose weight is at most
0.842 log_2 n times the optimal, improving a previous 0.999log_2 n approximation.
Utilizing a reduction from maximum TSP to the shortest superstring problem
we obtain a 2.5-approximation algorithm for the latter problem which is again
much simpler than the previous one. Other applications of the rounding procedure
are approximation algorithms for maximum 3-cycle cover (factor 2/3, previously 3/5)
and maximum asymmetric TSP with triangle inequality (factor 10/13, previously 3/4).

Tight Bounds for Approximate Nearest Neighbour Search

Amit Chakrabarti, Dartmouth

We consider the approximate nearest neighbour search problem (ANN) on
the d-dimensional Hamming Cube. We show that a randomised cell probe
algorithm that uses polynomial storage and poly(d) word size requires a
worst case query time of $Omega(log log d / log log log d)$. The
approximation factor may be as loose as $2^{log^{1-eta} d}$ for any
fixed eta > 0. This fills a major gap in the study of ANN since all
earlier lower bounds either did not allow randomisation or did not allow
approximation.

Generalising the dimension reduction techniques in the ANN algorithm of
Kushilevitz et al., we then give a cell probe upper bound which matches
the lower bound up to a constant, for any constant approximation factor
eps > 0.

Our lower bound uses information theoretic techniques for communication
complexity, a theme that has been prominent in recent research.

Joint work with Oded Regev.

Encryption of High-Entropy Sources: Fooling an Unbounded Adversary with a Short Key

Yevgeniy Dodis, NYU

Russell and Wang (RW02) recently introduced an elegant,
information-theoretic notion called "entropic security of encryption":
they require a condition similar to semantic security (GM84) but only
when the distribution on messages has high min-entropy. They show that
this notion of security can be achieved with very short keys, for
sufficiently entropic messages spaces.

We show that entropic security is equivalent to *indistinguishability*
of encryptions of messages from high min-entropy sources -- an
equivalence inspired by the conventional notion of
indistinguishability (GM84). This equivalence allows us to give a
variety of constructions which use key length (n - t - 2 log(1/eps))
for n-bit messages spaces of entropy at least t, with error parameter
eps. The constructions can be viewed as special randomness extractors
with an additional property to allow decryption.

The same equivalence also allows us to prove lower bounds on entropic
security, either by direct argument or by reduction to lower bounds on
extractors.

Joint work with Adam Smith (MIT).

Zero-Knowledge Sets

Silvio Micali, MIT

Despite the centrality of sets in Mathematics, no general zero-knowledge
treatment of sets existed to date.

We fill this gap by showing that a polynomial-time Prover can commit to an
arbitrary set S and then, for any sequence of potential elements X prove X
is in S or X is not in S, whichever the case may be, without revealing any
more than implied by these mere assertions.

Our implementation, based on the discrete logarithm problem, is
non-interactive and extremely efficient.

Joint Work with Michael Rabin and Joe Kilian

Bounded-Concurrent Secure Two-Party Computation in a Constant Number of Rounds

Alon Rosen, MIT

We consider the problem of constructing a general protocol for
secure two-party computation in a way that preserves security
under concurrent composition. In our treatment, we focus on the
case where an a-priori bound on the number of concurrent sessions
is specified before the protocol is constructed (a.k.a.~bounded
concurrency). We make no set-up assumptions.

Lindell (STOC 2003) has shown that any protocol for
bounded-concurrent secure two-party computation, whose security is
established via black-box simulation, must have round complexity
that is {\em strictly larger} than the bound on the number of
concurrent sessions. In this paper, we construct a (non black-box)
protocol for realizing bounded-concurrent secure two-party
computation in a {\em constant} number of rounds. The only
previously known protocol for realizing the above task required
more rounds than the pre-specified bound on the number of sessions
(despite usage of non black-box simulation techniques).

Our constructions rely on the existence of enhanced trapdoor
permutations, as well as on the existence of collision-resistant
hash functions. (Both assumptions are implied by the hardness of,
e.g., factoring large integers.)

Joint work with Rafael Pass.

Navigating nets: Simple algorithms for proximity search.

Robi Krauthgamer, IBM Almaden

Nearest-neighbor search is the following problem: Given a set $S$ of
$n$ points in a metric space $X$, preprocess $S$ so as to efficiently
find the point in $S$ that is closest to a query point $q\in X$.
Recently, there has been a significant interest in the case of general
metrics, because in many practical applications the underlying metric
is not Euclidean (or another $l_p$-metric).

We devise for this problem a simple and efficient dynamic data structure
(i.e., supports insertions and deletions to $S$) that answers
$1+\epsilon$ approximate nearest neighbor queries for any
$\epsilon>0$. Our algorithms are deterministic and provably correct
with no assumptions on the metric. Their (time and space) complexity
depends on the dimensionality of the data set $S$, which we measure
using a very natural notion borrowed from functional analysis. For
example, if $S$ has constant dimension, the above operations take
logarithmic time (for fixed $\epsilon>0$). Our data structure
outperforms the metric skip list of Karger and Ruhl [STOC, 2002] in
several aspects and is conceptually simpler.

[Joint work with James R. Lee.]

Measured descent: New embeddings for finite metrics

James Lee, UC Berkeley

We give a new proof of Bourgain's theorem based on metric decomposition
and a technique we call "measured descent." This provides a unified
framework for the two primary methods of constructing Frechet-type
embeddings, those based on the work of Bourgain and of Rao. Our
technique yields some interesting refinements of Bourgain's theorem
in terms of important parameters of the metric space--for instance, we
show that every n-point metric embeds into a Euclidean space with
distortion O(sqrt{dim(X) log n}), where dim(X) is the "metric dimension"
of X (which is always at most log n).

We use this technique to answer a number of open questions. For instance,
our embeddings are volume-respecting for subsets of arbitrary size; this
answers positively an open question of Feige. Another application achieves
the optimal dimension necessary to embed planar metrics into l_infty,
answering a question of Rabinovich.

[This is joint work with R. Krauthgamer, M. Mendel, and A. Naor.]

On Conflict-Free Colorings

Shakhar Smorodinsky, MSRI

Motivated by frequency assignment problems in cellular networks, we study
some coloring problems of the following flavor:

What is the minimum number $f(n)$ such that one can assign colors to any
set $P$ of $n$ points in the plane, using a total of at most $f(n)$ colors
and such that this coloring have the following property (which we refer to
as Conflict Free coloring):
For any disc $d$ in the plane, with nonempty intersection with $P$, there
is at least one point of $P$ inside $d$ which has a unique color among the
points inside $d$.

We show that $f(n) = O(\log n)$. This is asymptotically tight in the worst case.
We extend this result to many other classes of ranges (other than disks),
and also to the dual type of problems, where we want to color a given set
of ranges, so that for each point $p$ there is a range with a unique color
among the ranges that contain $p$.

Some parts of this talk is joint work with Guy Even, Zvika Lotker and Dana
Ron (Tel-Aviv University) and some other parts are joint work with Sariel
Har-Peled (UIUC).

A Brief History of Generative Models for Power Law and Lognormal
Distributions, and An Application to File Size Distributions

Michael Mitzenmacher, Harvard

We examine the the long history of power law and lognormal
distributions, showing that although these distributions are relatively
new to the computer science community, they have a long and interesting
history in other communities. We also describe how a new generative model
from economics leads us to a new model for file sizes, explaining why
they might have a power law distribution.

(archive of past announcements)