Binary Embedding: Fundamental Limits and Fast Algorithm



Similar documents
Machine Learning Applications in Grid Computing

arxiv: v1 [math.pr] 9 May 2008

Lecture L26-3D Rigid Body Dynamics: The Inertia Tensor

Online Bagging and Boosting

Data Set Generation for Rectangular Placement Problems

Reliability Constrained Packet-sizing for Linear Multi-hop Wireless Networks

ON SELF-ROUTING IN CLOS CONNECTION NETWORKS. BARRY G. DOUGLASS Electrical Engineering Department Texas A&M University College Station, TX

Use of extrapolation to forecast the working capital in the mechanical engineering companies

Media Adaptation Framework in Biofeedback System for Stroke Patient Rehabilitation

On Computing Nearest Neighbors with Applications to Decoding of Binary Linear Codes

Halloween Costume Ideas for the Wii Game

Searching strategy for multi-target discovery in wireless networks

This paper studies a rental firm that offers reusable products to price- and quality-of-service sensitive

6. Time (or Space) Series Analysis

CRM FACTORS ASSESSMENT USING ANALYTIC HIERARCHY PROCESS

RECURSIVE DYNAMIC PROGRAMMING: HEURISTIC RULES, BOUNDING AND STATE SPACE REDUCTION. Henrik Kure

Data Streaming Algorithms for Estimating Entropy of Network Traffic

Extended-Horizon Analysis of Pressure Sensitivities for Leak Detection in Water Distribution Networks: Application to the Barcelona Network

Partitioned Elias-Fano Indexes

Stable Learning in Coding Space for Multi-Class Decoding and Its Extension for Multi-Class Hypothesis Transfer Learning

The Virtual Spring Mass System

Multi-Class Deep Boosting

Impact of Processing Costs on Service Chain Placement in Network Functions Virtualization

Information Processing Letters

Modeling operational risk data reported above a time-varying threshold

Preference-based Search and Multi-criteria Optimization

Budget-optimal Crowdsourcing using Low-rank Matrix Approximations

Image restoration for a rectangular poor-pixels detector

2. FINDING A SOLUTION

Modified Latin Hypercube Sampling Monte Carlo (MLHSMC) Estimation for Average Quality Index

Efficient Key Management for Secure Group Communications with Bursty Behavior

Implementation of Active Queue Management in a Combined Input and Output Queued Switch

A quantum secret ballot. Abstract

Generating Certification Authority Authenticated Public Keys in Ad Hoc Networks

AN ALGORITHM FOR REDUCING THE DIMENSION AND SIZE OF A SAMPLE FOR DATA EXPLORATION PROCEDURES

Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

Pricing Asian Options using Monte Carlo Methods

Factored Models for Probabilistic Modal Logic

ABSTRACT KEYWORDS. Comonotonicity, dependence, correlation, concordance, copula, multivariate. 1. INTRODUCTION

Lecture L9 - Linear Impulse and Momentum. Collisions

Support Vector Machine Soft Margin Classifiers: Error Analysis

Factor Model. Arbitrage Pricing Theory. Systematic Versus Non-Systematic Risk. Intuitive Argument

INTEGRATED ENVIRONMENT FOR STORING AND HANDLING INFORMATION IN TASKS OF INDUCTIVE MODELLING FOR BUSINESS INTELLIGENCE SYSTEMS

An Approach to Combating Free-riding in Peer-to-Peer Networks

Bayes Point Machines

The Research of Measuring Approach and Energy Efficiency for Hadoop Periodic Jobs

Models and Algorithms for Stochastic Online Scheduling 1

Botnets Detection Based on IRC-Community

Resource Allocation in Wireless Networks with Multiple Relays

PERFORMANCE METRICS FOR THE IT SERVICES PORTFOLIO

The Application of Bandwidth Optimization Technique in SLA Negotiation Process

Software Quality Characteristics Tested For Mobile Application Development

Managing Complex Network Operation with Predictive Analytics

Optimal Resource-Constraint Project Scheduling with Overlapping Modes

Airline Yield Management with Overbooking, Cancellations, and No-Shows JANAKIRAM SUBRAMANIAN

Analyzing Spatiotemporal Characteristics of Education Network Traffic with Flexible Multiscale Entropy

Adaptive Modulation and Coding for Unmanned Aerial Vehicle (UAV) Radio Channel

Equivalent Tapped Delay Line Channel Responses with Reduced Taps

Real Time Target Tracking with Binary Sensor Networks and Parallel Computing

HOW CLOSE ARE THE OPTION PRICING FORMULAS OF BACHELIER AND BLACK-MERTON-SCHOLES?

ADJUSTING FOR QUALITY CHANGE

An improved TF-IDF approach for text classification *

Online Appendix I: A Model of Household Bargaining with Violence. In this appendix I develop a simple model of household bargaining that

Construction Economics & Finance. Module 3 Lecture-1

Applying Multiple Neural Networks on Large Scale Data

Performance Evaluation of Machine Learning Techniques using Software Cost Drivers

Capacity of Multiple-Antenna Systems With Both Receiver and Transmitter Channel State Information

Cross-Domain Metric Learning Based on Information Theory

An Optimal Task Allocation Model for System Cost Analysis in Heterogeneous Distributed Computing Systems: A Heuristic Approach

A CHAOS MODEL OF SUBHARMONIC OSCILLATIONS IN CURRENT MODE PWM BOOST CONVERTERS

Research Article Performance Evaluation of Human Resource Outsourcing in Food Processing Enterprises

Reconnect 04 Solving Integer Programs with Branch and Bound (and Branch and Cut)

ASIC Design Project Management Supported by Multi Agent Simulation

Calculating the Return on Investment (ROI) for DMSMS Management. The Problem with Cost Avoidance

An Integrated Approach for Monitoring Service Level Parameters of Software-Defined Networking

A framework for performance monitoring, load balancing, adaptive timeouts and quality of service in digital libraries

Audio Engineering Society. Convention Paper. Presented at the 119th Convention 2005 October 7 10 New York, New York USA

CPU Animation. Introduction. CPU skinning. CPUSkin Scalar:

An Innovate Dynamic Load Balancing Algorithm Based on Task

Cooperative Caching for Adaptive Bit Rate Streaming in Content Delivery Networks

Exploiting Hardware Heterogeneity within the Same Instance Type of Amazon EC2

Memory and Computation Efficient PCA via Very Sparse Random Projections

Comment on On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes

SOME APPLICATIONS OF FORECASTING Prof. Thomas B. Fomby Department of Economics Southern Methodist University May 2008

Presentation Safety Legislation and Standards

AUC Optimization vs. Error Rate Minimization

Investing in corporate bonds?

Evaluating Software Quality of Vendors using Fuzzy Analytic Hierarchy Process

PREDICTION OF POSSIBLE CONGESTIONS IN SLA CREATION PROCESS

Position Auctions and Non-uniform Conversion Rates

Transcription:

Binary Ebedding: Fundaental Liits and Fast Algorith Xinyang Yi The University of Texas at Austin yixy@utexas.edu Eric Price The University of Texas at Austin ecprice@cs.utexas.edu Constantine Caraanis The University of Texas at Austin constantine@utexas.edu Abstract Binary ebedding is a nonlinear diension reduction ethodology where high diensional data are ebedded into the Haing cube while preserving the structure of the original space. Specifically, for an arbitrary N distinct points in S p, our goal is to encode each point using - diensional binary strings such that we can reconstruct their geodesic distance up to δ unifor distortion. Existing binary ebedding algoriths either lack theoretical guarantees or suffer fro running tie O ( p ). We ake three contributions: () we establish a lower bound that shows any binary ebedding oblivious to the set of points requires = Ω( δ log N) bits and 2 a siilar lower bound for non-oblivious ebeddings into Haing distance; (2) we propose a novel fast binary ebedding algorith with provably optial bit coplexity = O ( δ log N ) 2 and near linear running tie O(p log p) whenever log N δ p, with a slightly worse running tie for larger log N; (3) we also provide an analytic result about ebedding a general set of points K S p with even infinite size. Our theoretical findings are supported through experients on both synthetic and real data sets. Introduction Low distortion ebeddings that transfor high-diensional points to low-diensional space have played an iportant role in dealing with storage, inforation retrieval and achine learning probles for odern datasets. Perhaps one of the ost faous results along these lines is the Johnson- Lindenstrauss (JL) lea Johnson and Lindenstrauss (984), which shows that N points can be ebedded into a O ( δ 2 log N ) -diensional space while preserving pairwise Euclidean distance up to δ-lipschitz distortion. This δ 2 dependence has been shown to be inforation-theoretically optial Alon (2003). Significant work has focused on fast algoriths for coputing the ebeddings, e.g., (Ailon and Chazelle, 2006; Kraher and Ward, 20; Ailon and Liberty, 203; Cheraghchi et al., 203; Nelson et al., 204).

More recently, there has been a growing interest in designing binary codes for high diensional points with low distortion, i.e., ebeddings into the binary cube (Weiss et al., 2009; Raginsky and Lazebnik, 2009; Salakhutdinov and Hinton, 2009; Liu et al., 20; Gong and Lazebnik, 20; Yu et al., 204). Copared to JL ebedding, ebedding into the binary cube (also called binary ebedding) has two advantages in practice: (i) As each data point is represented by a binary code, the disk size for storing the entire dataset is reduced considerably. (ii) Distance in binary cube is soe function of the Haing distance, which can be coputed quickly using coputationally efficient bit-wise operators. As a consequence, binary ebedding can be applied to a large nuber of doains such as biology, finance and coputer vision where the data are usually high diensional. While ost JL ebeddings are linear aps, any binary ebedding is fundaentally a nonlinear transforation. As we detail below, this nonlinearity poses significant new technical challenges for both upper and lower bounds. In particular, our understanding of the landscape is significantly less coplete. To the best of our knowledge, lower bounds are not known; ebedding algoriths for infinite sets have distortion-dependence δ significantly exceeding their finite-set counterparts; and perhaps ost significantly, there are no fast (near linear-tie) ebedding algoriths with strong perforance guarantees. As we explain below, this paper contributes to each of these three areas. First, we detail soe recent work and state of the art results. Recent Work. A coon approach pursued by several existing works, considers the natural extension of JL ebedding techniques via one bit quantization of the projections: b(x) = sign(ax), (.) where x R p is input data point, A R p is a projection atrix and b(x) is the ebedded binary code. In particular, Jacques et al. (20) shows when each entry of A is generated independently fro N (0, ), with > log N it with high probability achieves at ost δ (additive) distortion δ 2 for N points. Work in Plan and Vershynin (204) extend these results to arbitrary sets K S p where K can be infinite. They prove that the ebedding with δ-distortion can be obtained when w(k) 2 /δ 6 where w(k) is the Gaussian Mean Width of K. It is unknown whether the unusual δ 6 dependence is optial or not. Despite provable saple coplexity guarantees, one bit quantization of rando projection as in (.), suffers fro O ( p ) running tie for a single point. This quadratic dependence can result in a prohibitive coputational cost for high-diensional data. Analogously to the developents in fast JL ebeddings, there are several algoriths proposed to overcoe this coputational issue. Work in Gong et al. (203) proposes a bilinear projection ethod. By setting = O(p), their ethod reduces the running tie fro O(p 2 ) to O(p.5 ). More recently, work in Yu et al. (204) introduces a circulant rando projection algorith that requires running tie O ( p log p ). While these algoriths have reduced running tie, as of yet they coe without perforance guarantees: to the best of our knowledge, the easureent coplexities of the two algoriths are still unknown. Another line of work considers learning binary codes fro data by solving certain optiization probles (Weiss et al., 2009; Salakhutdinov and Hinton, 2009; Norouzi et al., 202; Yu et al., 204). Unfortunately, there is no known provable bits coplexity result for these algoriths. It is also worth noting that Raginsky and Lazebnik (2009) provide a binary code design for preserving shift-invariant kernels. Their ethod suffers fro the sae quadratic coputational issue copared with the fully rando Gaussian projection ethod. 2

Another related diension reduction technique is locality sensitive hashing (LSH) where the goal is to copute a discrete data structure such that siilar points are apped into the sae bucket with high probability (see, e.g., Andoni and Indyk (2006)). The key difference is that LSH preserves short distances, but binary ebedding preserves both short and far distances. For points that are far apart, LSH only cares that the hashings are different while binary ebedding cares how different they are. Contributions of this paper. In this paper, we address several unanswered probles about binary ebedding. We provide lower bounds for both data-oblivious and data-aware ebeddings; we provide a fast algorith for binary ebedding; and finally we consider the setting of infinite sets, and prove that in soe of the ost coon cases we can iprove the state-of-the-art saple coplexity guarantees by a factor of δ 2 :. We provide two lower bounds for binary ebeddings. The first shows that any ethod for ebedding and for recovering a distance estiate fro the ebedded points that is independent of the data being ebedded ust use Ω( log N) bits. This is based on a bound on the δ 2 counication coplexity of Haing distance used by Jayra and Woodruff (203) for a lower bound on the distributional JL ebedding. Separately, we give a lower bound for arbitrarily data-dependent ethods that ebed into (any function of) the Haing distance, showing such algoriths require = Ω( log N). This bound is siilar to Alon (2003) δ 2 log (/δ) which gets the sae result for JL, but the binary ebedding requires a different construction. 2. We provide the first provable fast algorith with optial easureent coplexity O ( log N ). δ 2 The proposed algorith has running tie O ( log δ 2 δ log2 N log p log 3 log N + p log p ) thus has alost linear tie coplexity when log N δ p. Our algorith is based on two key novel ideas. First, our siilarity is based on the edian Haing distance of sub-blocks of the binary code; second, our new ebedding takes advantage of a pair-wise independence arguent of Gaussian Toeplitz projection that could be of independent interest. 3. For arbitrary set K S p and the fully rando Gaussian projection algorith, we prove that = O(w(K + ) 2 /δ 4 ) is sufficient to achieve δ unifor distortion. Here K + is an expanded set of K. Although in general K K + and hence w(k) w(k + ), for interesting K such as sparse or low rank sets, one can show w(k + ) = Θ(w(K)) p. Therefore applying our theory to these sets results in an iproved dependence on δ copared to a recent result in Plan and Vershynin (204). See Section 3.3 for a detailed discussion. Discussion. For the fast binary ebedding, one siple solution, to the best of our knowledge not previously stated, is to cobine a Gaussian projection and the well known results about fast JL. In detail, consider the strategy b(x) = sign(afx), where A is a Gaussian atrix and F is any fast JL construction such as subsapled Walsh-Hadaard atrix Rudelson and Vershynin (2008) or partial circulant atrix Kraher et al. (204) with colun flips. A siple analysis shows that this approach achieves easureent coplexity O( log N) and running tie δ 2 O( log 2 N log p log 3 log N + p log p) by following the best known fast JL results. Our fast binary δ 4 ebedding algorith builds on this siple but effective thought. Instead of using a Gaussian atrix after the fast JL transfor, we use a series of Gaussian Toeplitz atrices that have fast atrix 3

vector ultiplication. This novel construction iproves the running tie by δ 2 while keeping easureent coplexity the sae. In order for this to work, we need to change the estiator fro straight Haing distance to one based on the edian of several Haing distances. An interesting point of coparison is Ailon and Rauhut (204), which considers RIP-optial distributions that give JL ebeddings with optial easureent coplexity O( log N) and running tie O(p log p). They show the existence of such ebeddings whenever log N < δ 2 p /2 γ for δ 2 any constant γ > 0, which is essentially no better than the bound given by the folklore ethod of coposing a Gaussian projection with a subsapled Fourier atrix. In our binary setting, we show how to iprove the region of optiality by a factor of δ. It would be interesting to try and translate this result back to the JL setting. Notation. We use [n] to denote natural nuber set {, 2,..., n}. For natural nubers a < b, let [a, b] denote the consecutive set {a, a +,..., b}. A vector in R n is denoted as x or equivalently (x, x 2,..., x n ). We use x I to denote the sub-vector of x with index set I [n]. We denote entry-wise vector ultiplication as x y = (x y, x 2 y 2,..., x n y n ). A atrix is typically denoted as M. Ter (i, j) of M is denoted as M i,j. Row i of M is denoted as M i. An n-by-n identity atrix is denoted as I n. For two rando variables X, Y, we denote the stateent that X and Y are independent as X Y. For two binary strings a, b {0, }, we use d H (a, b) to denote the noralized Haing distance, i.e., d H (a, b) := (a i b i ). 2 Organization, Proble Setup and Preliinaries In this section, we state our proble forally, give soe key definitions and present a siple (known) algorith that sets the stage for the ain results of this paper. The algorith (Algorith ), discussed in detail below, is siply the one-bit quantization of a standard JL ebedding. Its perforance on finite sets is easy to analyze, and we state it in Proposition 2.2 below. Three iportant questions reain unanswered: (i) Lower Bounds is the perforance guaranteed by Proposition 2.2 optial? We answer this affiratively in Section 3.. (ii) Fast Ebedding whereas Algorith is quadratic (depending on the product p), fast JL algoriths are nearly linear in p; does soething siilar exist for binary ebedding? We develop a new algorith in Section 3.2 that addresses the coplexity issue, while at the sae tie guaranteeing δ-ebedding with diension scaling that atches our lower bound. Interestingly, a key aspect of our contribution is that we use a slightly odified siilarity function, using the edian of the noralized Haing distance on sub-blocks. (iii) Infinite Sets recent work analyzing the setting of infinite sets K S p shows a dependence of δ 6 on the distortion. Is this optial? We show in Section 3.3 that in any settings this can be iproved by a factor of δ 2. In Section 4, we provide nuerical results. We give ost proofs in Section 5. 2. Proble Setup Given a set of p-diensional points, our goal is to find a transforation f : R p {0, } such that the Haing distance (or other related, easily coputable etric) between two binary codes is close to their siilarity in the original space. We consider points on the unit sphere S p and use 4

the noralized geodesic distance (occasionally, and soewhat isleadingly, called cosine siilarity) as the input space siilarity etric. For two points x, y R p, we use d(x, y) to denote the geodesic distance, defined as d(x, y) := (x/ x 2, y/ y 2 ), π where (, ) denotes the angle between two vectors. For x, y S p, the etric d(x, y) is proportional to the length of the shortest path connecting x, y on the sphere. Given the success of JL ebedding, a natural approach is to consider the one bit quantization of a rando projection: b = sign(ax), (2.) where A is soe rando projection atrix. Given two points x, y with ebedding vectors b, and c, we have b i c i if and only if A i, x A i, y < 0. The traditional etric in the ebedded space has been the so-called noralized Haing distance, which we done by d A (x, y) and is defined as follows. d A (x, y) := { sign ( A i, x ) sign ( A i, y )}. (2.2) Definition 2.. (δ-unifor Ebedding) Given a set K S p and projection atrix A R p, we say the ebedding b = sign(ax) provides a δ-unifor ebedding for points in K if da (x, y) d(x, y) δ, x, y K. (2.3) Note that unlike for JL, we ai to control additive error instead of relative error. Due to the inherently liited resolution of binary ebedding, controlling relative error would force the ebedding diension to scale inversely with the iniu distance of the original points, and in particular would be ipossible for any infinite set. 2.2 Unifor Rando Projection Algorith Unifor Rando Projection input Finite nuber of points K = {x i } K where K Sp, ebedding target diension. : Construct atrix A R p where each entry A i,j is drawn independently fro N (0, ). 2: for i =, 2,..., K do 3: b i sign(ax i ). 4: end for output {b i } K Algorith presents (2.) forally, when A is an i.i.d. Gaussian rando atrix, i.e., A i N (0, I p ) for any i []. It is easy to observe that for two fixed points x, y S p we have ( { E sign ( A i, x ) sign ( A i, y )}) = d(x, y), i []. (2.4) 5

The above equality has a geoetric explanation: each A i actually represents a uniforly distributed rando hyperplane in R p. Then sign ( A i, x ) sign ( A i, y ) holds if and only if hyperplane A i intersects the arc between x and y. In fact, d A (x, y) is equal to the fraction of such hyperplanes. Under such unifor tessellation, the probability with which the aforeentioned event occurs is d(x, y). Applying Hoeffding s inequality and probabilistic union bound over N 2 pairs of points, we have the following straightforward guarantee. Proposition 2.2. Given a set K S p with finite size K, consider Algorith with c(/δ 2 ) log K. Then with probability at least 2 exp( δ 2 ), we have d A (x, y) d(x, y) δ, x, y K. Here c is soe absolute constant. Proof. The proof idea is standard and follows fro the above; we oit the details. 3 Main Results We now present our ain results on lower bounds, on fast binary ebedding, and finally, on a general result for infinite sets. 3. Lower Bounds We offer two different lower bounds. The first shows that any ebedding technique that is oblivious to the input points ust use Ω( log N) bits, regardless of what ethod is used to estiate δ 2 geodesic distance fro the ebeddings. This shows that unifor rando projection and our fast binary ebedding achieve optial bit coplexity (up to constants). The bound follows fro results by Jayra and Woodruff (203) on the counication coplexity of Haing distance. Theore 3.. Consider any distribution on ebedding functions f : S p {0, } and reconstruction algoriths g : {0, } {0, } R such that for any x,..., x N S p we have g(f(xi ), f(x j )) d(x i, x j ) δ for all i, j [N] with probability ɛ. Then = Ω( δ 2 log(n/ɛ)). Proof. See Section 5. for detailed proof. One could iagine, however, that an ebedding could use knowledge of the input point set to ebed any specific set of points into a lower-diensional space than is possible with an oblivious algorith. In the Johnson-Lindenstrauss setting, Alon (2003) showed that this is not possible beyond (possibly) a log(/δ) factor. We show the analogous result for binary ebeddings. Relative to Theore 3., our second lower bound works for data-dependent ebedding functions but loses a log(/δ) and requires the reconstruction function to depend only on the Haing distance between the two strings. This restriction is natural because an unrestricted data-dependent reconstruction function could siply encode the answers and avoid any dependence on δ. 6

With the schee given in (2.), choosing A as a fully rando Gaussian atrix yields d A (x, y) d(x, y). However, an arbitrary binary ebedding algorith ay not yield a linear functional relationship between Haing distance and geodesic distance. Thus for this lower bound, we allow the design of an algorith with arbitrary link function L. Definition 3.2. (Data-dependent binary ebedding proble) Let L : [0, ] [0, ] be a onotonic and continuous function. Given a set of points x, x 2,..., x N S p, we say a binary ebedding apping f solves the binary ebedding proble in ters of link function L, if d H (f(x i ), f(x j )) L ( d(x i, x j ) ) δ, i, j [N]. (3.) Although the choice of L is flexible, note that for the sae point, we always have d H (f(x i ), f(x i )) = d(x i, x i ) = 0, thus (3.) iplies L(0) < δ. We can just let L(0) = 0. In particular, we let L ax = L(). We have the following lower bound: Theore 3.3. There exist 2N points x, x 2,..., x 2N S N such that for any binary ebedding algorith f on {x i } 2N, if it solves the data-dependent binary ebedding proble defined in 3.2 in ters of link function L and any δ (0, 6 e L ax), it ust satisfy 28e Proof. See Section 5.2 for detailed proof. ( Lax δ ) 2 log N log Lax 2δ. (3.2) Reark 3.4. We ake two rearks for the above result. () When L ax is soe constant, our result iplies that for general N points, any binary ebedding algorith (even data-dependent ) ust have Ω( log N) nuber of easureents. This is analogous to Alon s lower bound δ 2 log δ in the JL setting. It is worth highlighting two differences: (i) The JL setting considers the sae etric (Euclidean distance) for both the input and the ebedded spaces. In binary ebedding, however, we are interested in showing the relationship between Haing distance and geodesic distance. (ii) Our lower bound is applicable to a broader class of binary ebedding algoriths as it involves arbitrary, even data-dependent, link function L. Such an extension is not considered in the lower bound of JL. (2) The stated lower bound only depends on L ax and does not depend on any curvature inforation of L. The constraint L ax > 6 eδ is critical for our lower bound to hold, but soe such restriction is necessary because for L ax < δ, we are able to ebed all points into just one bit. In this case d H (f(x i ), f(x j )) = 0 for all pairs and condition (3.) would hold trivially. 3.2 Fast Binary Ebedding In this section, we present a novel fast binary ebedding algorith. We then establish its theoretical guarantees. There are two key ideas that we leverage: (i) instead of noralized Haing distance, we use a related etric, the edian of the noralized Haing distance applied to sub-blocks; and (ii) we show a key pair-wise independence lea for partial Gaussian Toeplitz projection, that allows us to use a concentration bound that then iplies nearness in the edian-etric we use. 7

3.2. Method Our algorith builds on sub-sapled Walsh-Hadaard atrix and partial Gaussian Toeplitz atrices with rando colun flips. In particular, an -by-p partial Walsh-Hadaard atrix has the for Φ := P H D. (3.3) The above construction has three coponents. We characterize each ter as follows: Ter D is a p-by-p diagonal atrix with diagonal ters {ζ i } p that are drawn fro i.i.d. Radeacher sequence, i.e, for any i [p], Pr(ζ i = ) = Pr(ζ i = ) = /2. Ter H is a p-by-p scaled Walsh-Hadaard atrix such that H H = I p. Ter P is an -by-p sparse atrix where one entry of each row is set to be while the rest are 0. The nonzero coordinate of each row is drawn independently fro unifor distribution. In fact, the role of P is to randoly select p rows of H D. An -by-n partial Gaussian Toeplitz atrix has the for We introduce each ter as follows: Ψ := P T D. (3.4) Ter D a is n-by-n diagonal atrix with diagonal ters {ζ i } n Radeacher sequence. that are drawn fro i.i.d. Ter T is a n-by-n Toeplitz atrix constructed fro (2n )-diensional vector g such that T i,j = g i j+n for any i, j [n]. In particular, g is drawn fro N (0, I 2n ). Ter P is an -by-n sparse atrix where P i = e i for any i []. Equivalently, we use P to select the first rows of TD. It s worth to note we actually only need to select any distinct rows. With the above constructions in hand, we present our fast algorith in Algorith 2. At a high level, Algorith 2 consists of two parts: First, we apply colun flipped partial Hadaard transfor to convert p-diensional point into n-diensional interediate point. Second, we use B independent (/B)-by-n partial Gaussian Toeplitz atrices and sign operator to ap an interediate point into B blocks of binary codes. In ters of siilarity coputation for the ebedded codes, we use the edian of each block s noralized Haing distance. In detail, for b, c {0, }, B-wise noralized Haing distance is defined as ( ) d H (b, c; B) := edian ({d } B ) H bti, c Ti (3.5) where T i = [i +, i + /B]. It is worth noting that our first step is one construction of fast JL transfor. In fact any fast JL transfor would work for our construction, but we choose a standard one with real value: based on 8 i=0

Rudelson and Vershynin (2008); Cheraghchi et al. (203); Kraher and Ward (20), it is known that with = O ( ɛ 2 log N log p log 3 (log N) ) easureents, a subsapled Hadaard atrix with colun flips becoes an ɛ-jl atrix for N points. The second part of our algorith follows fraework (2.). By choosing a Gaussian rando vector in each row of Ψ, fro our previous discussion in Section 2.2, the probability that such a hyperplane intersects the arc between two points is equal to their geodesic distance. Copared to a fully rando Gaussian atrix, as used in Algorith, the key difference is that the hyperplanes represented by rows of Ψ are not independent to each other; this iposes the ain analytical challenge. Algorith 2 Fast Binary Ebedding input Finite nuber of points {x i } N where each point x i S p, ebedded diension, interediate diension n, nuber of blocks B. : Draw a n-by-p sub-sapled Walsh-Hadaard atrix Φ according to (3.3). Draw B independent partial Gaussian Toeplitz atrices { Ψ (j)} B with size (/B)-by-n according to (3.4). j= 2: {Part I: Fast JL} 3: for i =, 2,..., N do 4: y i Φ x i. 5: end for 6: {Part II: Partial Gaussian Toeplitz Projection} 7: for i =, 2,..., N do 8: for j =, 2,..., B do 9: c j sign ( Ψ (j) ) y i. 0: end for : b i [c ; c 2 ;... ; c B ] 2: end for output {b i } N 3.2.2 Analysis We give the analysis for Algorith 2. We first review a well known result about fast JL transfor. Lea 3.5. Consider the colun flipped partial Hadaard atrix defined in (3.3) with size - by-p. For N points x, x 2,..., x N S p, let y i = p Φ(ζ) x i, i [N]. For soe absolute constant c, suppose cδ 2 log N log p log 3 (log N), then with probability at least 0.99, we have that for any i, j [N] y i y j 2 x i x j 2 δ x i x j 2, (3.6) and for any i [N] y i 2 δ. (3.7) Proof. It can be proved by cobining Theore 4 in Cheraghchi et al. (203) and Theore 3. in Kraher and Ward (20). 9

The above result suggests that the first part of our algorith reduces the diension while preserving well the Euclidean distance of each pair. Under this condition, all the pairwise geodesic distances are also well preserved as confired by the following result. Lea 3.6. Consider the set of ebedded points {y i } N defined in Lea 3.5. Suppose conditions (3.6)-(3.7) hold with δ > 0. Then for any i, j [N], holds with soe absolute constant C. Proof. We postpone the proof to Appendix A. d(yi, y j ) d(x i, x j ) Cδ (3.8) The next result is our independence lea, and is one of the key technical ideas that ake our result possible. The result shows that for any fixed x, Gaussian Toeplitz projection (with colun flips) plus sign( ) generate pair-wise independent binary codes. Lea 3.7. Let g N (0, I 2n ), ζ = {ζ i } i=n be an i.i.d. Radeacher sequence. Let T be a rando Toeplitz atrix constructed fro g such that T i,j = g i j+n. Consider any two distinct rows of T say ξ, ξ. For any two fixed vectors x, y R n, we define the following rando variables X = sign ξ ζ, x, X = sign ξ ζ, x ; Y = sign ξ ζ, y, Y = sign ξ ζ, y. We have X X, X Y, Y X, Y Y. Proof. See Section 5.3. for detailed proof. We are ready to prove the following result about Algorith 2. Theore 3.8. Consider Algorith 2 with rando atrices Φ, Ψ defined in (3.3) and (3.4) respectively. For finite nuber of points {x i } N, let b i be the binary codes of x i generated by Algorith 2. Suppose we set B c log N, n c (/δ 2 ) log N log p log 3 (log N), n /B c (/δ 2 ), with soe absolute constants c, c, c, then with probability at least 0.98, we have that for any i, j [N] d H (b i, b j ; B) d(x i, x j ) δ. Siilarity etric d H (, ; B) is the edian of noralized Haing distance defined in (3.5). Proof. See Section 5.3.2 for detailed proof. 0

The above result suggests that the easureent coplexity of our fast algorith is O ( log N ) δ 2 which atches the perforance of Algorith based on fully rando atrix. Note that this easureent coplexity can not be iproved significantly by any data-oblivious binary ebedding with any siilarity etric, as suggested by Theore 3.. Running tie: The first part of our algorith takes tie O ( p log p ). Generating a single block of binary codes fro partial Toeplitz atrix takes tie O ( n log( δ )). Thus the total running tie is O ( Bn log δ + p log p) = O ( log δ 2 δ log2 N log p log 3 (log N) + p log p ). By ignoring the polynoial log log factor, the second ter O ( p log p ) doinates when log N δ p/ log δ. Coparison to an alternative algorith: Instead of utilizing the partial Gaussian Toeplitz projection, an alternative ethod, to the best of our knowledge not previously stated, is to use fully rando Gaussian projection in the second part of our algorith. We present the details in Algorith 3. By cobining Proposition 2.2 and Lea 3.5, it is straightforward to show this algorith still achieves the sae easureent coplexity O ( log N ). The corresponding running δ 2 tie is O ( log 2 N log p log 3 (log N) + p log p ), so it is fast when log N δ 2 p. Therefore our δ 4 algorith has an iproved dependence on δ. This iproveent coes fro fast ultiplication of partial Toeplitz atrix and a pair-wise independence arguent shown in Lea 3.7. Algorith 3 Alternative Fast Binary Ebedding input Finite nuber of points {x i } N where each point x i S p, ebedded diension, interediate diension n. : Draw a n-by-p sub-sapled Walsh-Hadaard atrix Φ according to (3.3). Construct -by-n atrix A where each entry is drawn independently fro N (0, ). 2: for i =, 2,..., N do 3: b i sign(aφx i ) 4: end for output {b i } N 3.3 δ-unifor Ebedding for General K In this section, we turn back to the fully rando projection binary ebedding (Algorith ). Recall that in Proposition 2.2, we show for finite size K, = O( log K ) easureents are sufficient δ 2 to achieve δ-unifor ebedding. For general K, the challenge is that there ight be an infinite nuber of distinct points in K, so Proposition 2.2 cannot be applied. In proving the JL lea for an infinite set K, the standard technique is either constructing an ɛ-net of K or reducing the distortion to the deviation bound of a Gaussian process. However, due to the non-linearity essential for binary ebedding, these techniques cannot be directly extended to our setting. Therefore strengthening Proposition 2.2 to infinite size K iposes significant technical challenges. Before stating our result, we first give soe definitions. Definition 3.9. (Gaussian ean width) Let g N (0, I p ). For any set K S p, the Gaussian Matrix-vector ultiplication for -by-n partial Toeplitz atrix can be ipleented in running tie O ( n log ).

ean width of K is defined as w(k) := E g sup g, x. x K Here, w(k) 2 easures the effective diension of set K. In the trivial case K = S p, we have w(k) 2 p. However, when K has soe special structure, we ay have w(k) 2 p. For instance, when K = {x S p : supp(x) s}, it has been shown that w(k) = Θ( s log(p/s)) (see Lea 2.3 in Plan and Vershynin (203)). For a given δ, we define K + δ, the expanded version of K Sp as: K + δ := K { z S p : z = x y x y 2, x, y K if δ 2 x y 2 δ }. (3.9) In other words, K + δ is constructed fro K by adding the noralized differences between pairs of points in K that are within δ but not closer than δ 2. Now we state the ain result as follows. Theore 3.0. Consider any K S p. Let A R p be an i.i.d. Gaussian atrix where each row A i N (0, I p ). For any two points x, y K, d A (x, y) is defined in (2.2). Expanded set K + δ is defined in (3.9). When c w(k+ δ )2 δ 4, with soe absolute constant c, then we have that sup x,y K d A (x, y) d(x, y) δ holds with probability at least c exp( c 2 δ 2 ) where c, c 2 are absolute constants. Proof. See Section 5.4 for detailed proof. Reark 3.. We copare the above result to Theore.5 fro the recent paper Plan and Vershynin (204) where it is proved that for w(k) 2 /δ 6, Algorith is guaranteed to achieve δ-unifor ebedding for general K. Based on definition (3.9), we have w(k) w(k + δ ) δ 2 w(k K) δ 2 w(k). Thus in the worst case, Theore 3.0 recovers the previous result up to a factor. More iportantly, for any interesting sets one can show w(k + δ δ 2 ) w(k); in such cases, our result leads to an iproved dependence on δ. We give several such exaples as follows: Low rank set. For soe U R p r such that U U = I r, let K = {x S p : x = Uc, c S r }. We siply have K = K + δ and w(k) r. Our result iplies = O ( r/δ 4). Sparse set. K = {x S p : supp(x) s}. In this case we have K + δ {x S p : supp(x) 2s}. Therefore w(k + δ ) = Θ( s log(p/s)). Our result iplies = O ( s log(p/s) δ 4 ). Set with finite size. K <. As w(k) log K and K + δ 2 K, our result iplies = O ( log K /δ 4). We thus recover Proposition 2.2 up to factor /δ 2. Applying the result fro Plan and Vershynin (204) to the above sets iplies siilar results but the dependence on δ becoes /δ 6. 2

4 Nuerical Results In this section, we present the results of experients we conduct to validate our theory and copare the perforance of the following three algoriths we discussed: unifor rando projection (URP) (Algorith ), fast binary ebedding (FBE) (Algorith 2) and the alternative fast binary ebedding (FBE-2) (Algorith 3). We first apply these algoriths to synthetic datasets. In detail, given paraeters (N, p), a synthetic dataset is constructed by sapling N points fro S p uniforly at rando. Recall that δ is the axiu ebedding distortion aong all pairs of points. We use to denote the nuber of binary easureents. Algorith FBE needs paraeters n, B, which are interediate diension and nuber of blocks respectively. Based on Theore 3.8, n is required to be proportional to (up to soe logarithic factors) and B is required to be proportional to log N. We thus set n.3, B.8 log N. We also set n.3 for FBE-2. In addition, we fix p = 52. We report our first result showing the functional relationship between (, N, δ) in Figure. In particular, panel (a) shows the the change of distortion δ over the nuber of easureents for fixed N. We observe that, for all the three algoriths, δ decays with at the rate predicted by Proposition 2.2 and Theore 3.8. Panel (b) shows the epirical relationship between and log N for fixed δ. As predicted by our theory (lower bound and upper bound), has a linear dependence on log N. δ 0.55 0.5 0.45 0.4 0.35 0.3 0.25 FBE FBE-2 URP 0.2 50 00 50 200 250 (a) N = 300 300 250 200 50 00 50 FBE FBE-2 URP 5 6 7 8 logn (b) δ = 0.3 Figure : Results on synthetic datasets. (a) Each point, along with the standard deviation represented by the error bar, is an average of 50 trials each of which is based on a fresh synthetic dataset with size N = 300 and newly constructed ebedding apping. (b) Each point is coputed by slicing at δ = 0.3 in siilar plots like (a) but with the corresponding N. A popular application of binary ebedding is iage retrieval, as considered in (Gong and Lazebnik, 20; Gong et al., 203; Yu et al., 204). We thus conduct an experient on the Flickr-25600 dataset that consists of 0k iages fro Internet. Each iage is represented by a 25600-diensional noralized Fisher vector. We take 500 randoly sapled iages as query points and leave the rest as base for retrieval. The relevant iages of each query are defined as its 0 nearest neighbors 3

0.8 Recall 0.6 0.4 0.2 FBE FBE-2 FBE-2(sae tie) URP URP(sae tie) 20 40 60 80 00 Nuber of retrieved iages Recall 0.8 0.6 0.4 FBE FBE-2 FBE-2(sae tie) URP URP(sae tie) 20 40 60 80 00 Nuber of retrieved iages Recall 0.8 0.6 0.4 FBE FBE-2 FBE-2(sae tie) URP URP(sae tie) 20 40 60 80 00 Nuber of retrieved iages (a) = 5000 (b) = 0000 (c) = 5000 Figure 2: Iage retrieval results on Flickr-25600. Each panel presents the recall for specified nuber of easureents. Black and blue dot lines are respectively the recall of FBE-2 and URP with less nuber of easureents but the sae running tie as FBE. based on geodesic distance. Given, we apply FBE, FBE-2 and URP to convert all iages into -diensional binary codes. In particular, we set B = 0 for FBE and n.3 for FBE and FBE-2. Then we leverage the corresponding siilarity etrics, (3.5) for FBE and Haing distance for FBE-2 and URP, to retrieve the nearest iages for each query. The perforance of each algorith is characterized by recall, i.e., the nuber of retrieved relevant iages divided by the total nuber of relevant iages. We report our second result in Figure 2. Each panel shows the average recall of all queries for a specified. We note that FBE-2, as a fast algorith, perfors as well as URP with the sae nuber of easureents. In order to show the running tie advantage of our fast algorith FBE, we also present the perforance of FBE-2 and URP with fewer easureents such that they can be coputed with the sae tie as FBE. As we observe, with large nuber of easureents, FBE-2 and URP perfor arginally better than FBE while FBE has a significant iproveent over the two algoriths under identical tie constraint. 5 Proofs 5. Proof of Data-Oblivious Lower Bound (Theore 3.) The proof of the data-oblivious lower bound is based on a lower bound for one-way counication of Haing distance due to Jayra and Woodruff (203). Definition 5. (One-way counication of Haing distance). In the one-way counication odel, Alice is given a {0, } n and Bob is given b {0, } n. Alice sends Bob a essage c {0, }, and Bob uses b and c to output a value x R. Alice and Bob have shared randoness. Alice and Bob solve the (δ, ɛ) additive Haing distance estiation proble if x d H (a, b) δ with probability ɛ. The result proven in Jayra and Woodruff (203) is a lower bound for the ultiplicative Haing distance estiation proble, but their techniques readily yield a bound for the additive case 4

as well: Lea 5.2. Any algorith that solves the (δ, ɛ) additive Haing distance estiation proble ust have = Ω((/δ 2 ) log(/ɛ)) as long as this is less than n. Proof. We apply Lea 3. of Jayra and Woodruff (203) with paraeters α = 2, p =, b =, ε = δ, and δ = ɛ. This encodes inputs fro a proble they prove is hard (augented indexing on large doains) to inputs appropriate for Haing estiation. In particular, for n = O( log(/ɛ)) δ 2 it gives a distribution on (a, b) {0, } n {0, } n that are divided into NO and YES instances, such that: Fro the reduction, distinguishing NO instances fro YES instances with probability ɛ requires Alice to send = Ω( δ 2 log(/ɛ)) bits of counication to Bob. In NO instances, d H (a, b) 2 ( δ/3). In YES instances, d H (a, b) 2 ( 2δ/3). First, suppose n = n. Then since solving the additive Haing distance estiation proble with δ/2 accuracy would distinguish NO instances fro YES instances, it ust involve = Ω( log(/ɛ)) bits of counication. δ 2 For n > n, siply duplicate the coordinates of a and b n/n ties, and zero-pad the reainder. Less than half the coordinates are then part of the zero-padding, so the gap between YES and NO instances reains at least δ/2 and a protocol for the (δ/24, ɛ) additive Haing distance estiation proble requires = Ω( log(/ɛ)) as desired. δ 2 With this in hand, we can prove Theore 3.: Proof of Theore 3.. We reduce one-way counication of the (δ, ɛ) additive Haing distance estiation proble to the ebedding proble. Let a, b {0, } p be drawn fro the hard instance for the counication proble defined in Lea 5.2. Linearly transfor the to u, v S p via u = (2 a )/ p, v = (2 b )/ p. We have that u, v = 2d H (a, b), so d(u, v) = arccos( u, v ) π = arccos( 2d H(a, b)) π or d H (a, b) = ( cos(π πd(u, v))) 2 Given an estiate of d(u, v), we can therefore get an estiate of d H (a, b). In particular, since cos (x), if we learn d(u, v) to ±δ then we learn d H (a, b) to ±δ π 2. For now, consider the case of N = 2. Consider an oblivious ebedding function f : S p {0, } and reconstruction algorith g : {0, } {0, } R that has g(f(u), f(v)) d(u, v) δ 2 π with probability ɛ on the distribution of inputs (u, v). We can solve the one-way counication proble for Haing distance estiation by Alice sending f(u) to Bob, Bob learning d(u, v) 5

g(f(u), f(v)), and then coputing d H (a, b) to ±δ. By the lower bound for this proble, any such f and g ust have = Ω( log δ 2 ɛ ), proving the result for N = 2 (after rescaling δ). For general N, we draw instances (u, v ), (u 2, v 2 ),..., (u N/2, v N/2 ) independently fro the hard instance for binary ebedding of N = 2 and ɛ = 4ɛ/N. Consider an oblivious ebedding function f : S p {0, } and reconstruction algorith g : {0, } {0, } R that has for all i [N/2] that g(f(u i ), f(v i )) d(u i, v i ) δ with probability ɛ on this distribution. Define α to be the probability that g(f(u i ), f(v i )) d(u i, v i ) δ for any particular i. Because f and g are oblivious and the different instances are independent, we have the probability that all instances succeed is α N/2 ɛ, so α > ( ɛ) 2/N > 4ɛ/N. In particular, this eans f and g solve the hard instance of binary ebedding and N = 2, ɛ = 4ɛ/N. By the above lower bound for N = 2, this eans as desired. = Ω( δ 2 log(n/ɛ)) 5.2 Proof of Data-Dependent Lower Bound (Theore 3.3) We need a few ingredients to show the lower bound. identity atrix. First, we define a atrix that is close to Definition 5.3. ((δ, δ 2 )-near identity atrix) Syetric atrix M R p p is called a (δ, δ 2 )-near identity atrix if it satisfies both of the following conditions: δ M i,i, i [p], M i,j δ 2, i j [p]. Next we give a lower bound on the rank of (δ, δ 2 )-near identity atrix. Lea 5.4. Suppose positive seidefinite atrix M R p p is a (δ, δ 2 )-near identity atrix with rank d, and 0 < δ, δ 2 <. Then we have Proof. We postpone the proof to Appendix B. d p( δ ) 2 + (p )δ2 2. The above result is weak when it is applied to show our desired lower bound. We still need to ake use of the following cobinatorial result. 6

Lea 5.5. Suppose atrix M R p p has rank d. Let P (x) be any degree k polynoial function. Consider atrix N R p p defined as N := P (M), where the N i,j = P (M i,j ). We have ( ) k + d rank(n). k Proof. See Lea 9.2 of Alon (2003) for a detailed proof. Now we are ready to prove Theore 3.3. Proof of Theore 3.3. Let e i denote the i th natural basis of R N, i.e., the i th coordinate is while the rest are all zeros. Consider N points {e, e 2,..., e N } and their opposite vectors { e, e 2,..., e N }. For any binary ebedding algorith f, we let b i := f(e i ), i [N], c i := f( e i ), i [N]. Under the condition that f solves the general binary ebedding proble with link function L, we have d H (b i, c i ) L ( d(e i, e i ) ) δ, i [N]. (5.) As d(e i, e i ) =, we have Siilarly, note that L() + δ d H (b i, c i ) L() δ. (5.2) d(e i, e j ) = d(e i, e j ) = d( e i, e j ) = 2, i j, we have i j L(/2) δ d H (b i, b j ) L(/2) + δ, (5.3) L(/2) δ d H (c i, c j ) L(/2) + δ, (5.4) L(/2) δ d H (b i, c j ) L(/2) + δ. (5.5) Fro now on, we treat binary strings b i, c i as vectors in R. Let B denote the atrix with rows b i and C denote the atrix with rows c i. Consider the outer product of the difference between B and C, naely M = (B C)(B C). Note that i [N], M i,i = b i c i 2 2 = 4 d H (b i, c i ) 4 ( L() δ ). The last inequality follows fro (5.2). For i j, we have M i,j = b i c i, b j c j = bi, b j + ci, c j bi, c j bj, c i ( ) = 2 d H (b i, c j ) + d H (b j, c i ) d H (b i, b j ) d H (c i, c j ), 7

where the third equality follows fro By using (5.3) to (5.5), we have ( Therefore, 4 (L()+δ) M is actually a Mi,j 8δ. ) -near identity atrix. Consider degree k polynoial P (z) = z k. Let and d H (b, c) = ( b 2 4 2 + c 2 2 2 b, c ) b, c {, } 2δ L(), 2δ L() N = P ( 4 L() M). It is easy to observe that N is a (γ, γ 2 )-near identity atrix where Under the condition By setting k = 2 log N log L() 2δ δ L() 4, we have, we have γ = ( γ = ( 2δ L() )k, γ 2 = ( 2δ ) k. L() δ L() )k ( 2 )k. γ 2 N. We apply Lea 5.4 by setting δ, δ 2, p in the stateent to be γ, γ 2, N respectively. We get rank(n) N( 4 )k + (N )/N 2 ( 4 )k N ( 8 )k N. (5.6) On the other hand, 4 L() M has rank at ost. By applying Lea 5.5 we get ( ) + k rank(n) ( e( + k) ) k. k k Applying the above result and (5.6) directly yields that When k = 2 log N log L() 2δ (N) /k 8e + k. k as we set, N /k ( L() 2δ )2. Therefore we have 32e (L()) 2k k δ 64e where the second inequality holds when ( L() ) 2 2δ 64e. (L()) 2k = 2δ 28e 8 (L()) 2 log N δ log L() 2δ,

5.3 Proofs about Fast Binary Ebedding Algorith 5.3. Proof of Lea 3.7 Proof. It suffices to prove X Y. One can check siilarly that the proof holds for the reaining three results. Note that X, Y are binary rando variables with values {, }. It is easy to observe both of the are balanced, naely Pr(X = ) = Pr(Y = ) = /2. If X Y, then we have Pr(X = Y ) = /2. In the reverse direction, suppose Pr(X = Y ) = /2. First we have Pr(X = ) = Pr(X =, Y = ) + Pr(X =, Y = ) = /2, (5.7) Pr(Y = ) = Pr(X =, Y = ) + Pr(X =, Y = ) = /2. (5.8) Cobining the above two results, we have Pr(X =, Y = ) = Pr(X =, Y = ). Using Pr(X =, Y = ) + Pr(X =, Y = ) = Pr(X Y ) = Pr(X = Y ) = 2, we thus have Pr(X =, Y = ) = Pr(X =, Y = ) = /4. Plugging the above result into (5.7) and (5.8) we have Pr(X =, Y = ) = Pr(X =, Y = ) = /4. Thus we have shown Pr(X = v Y = u) = Pr(X = v, Y = u) Pr(Y = u) = Pr(X = v), u, v {, }, which leads to X Y. Using the above arguents, we show that X Y if and only if Pr(X = Y ) = /2. Recalling the definition of X, Y, the above condition holds if and only if { ξ Pr ζ, x ξ ζ, y } 0 = } {{ } 2. Z Next we prove Z has syetric distribution around 0. Let I = [, n], I = [, n ], I 0 = [2n, 2n ] for soe natural nuber < n. Without loss of generality, we assue ξ = g I and ξ = [g I0 ; g I ]. We split I into T = n consecutive disjoint subsets I, I 2,..., I T each of which has size except I T = n (T ). Also, let I T contain the first n (T ) entries of I T. Then we have ( T ) Z = gii ζ Ii, x Ii ( T 2 gii ζ Ii+, y Ii+ + gi T ζ I T, y IT + gi0 ζ I, y I ). (5.9) We now let ĝ be such rando vector that is identical to g except that for any i {0} [T ] ĝ Ii = g Ii, if i od 2 = 0 Let ζ be such rando vector that is identical to ζ except that for any i {0} [T ] ζ Ii = ζ Ii, if i od 2 =. 9

Replacing g, ζ in (5.9) with ĝ, ζ yields Ẑ ( T = ĝii ζ ) Ii, x Ii ( = = Z. T ) gii ζ Ii, x Ii ( T 2 ( T 2 ĝii ζ Ii+, y Ii+ + ĝi T ζ IT, y IT + ĝi0 ζ I, y I ) gii ζ Ii+, y Ii+ + gi T ζ I T, y IT + gi0 ζ I, y I ) As each entry of g is syetric rando variable around 0, therefore ĝ and g has the sae probability distribution. The sae fact also holds for ζ and ζ. So we conclude that Z has syetric distribution around 0, which iplies Pr(Z > 0) = 2 and X Y. 5.3.2 Proof of Theore 3.8 Proof. Unspecified notations in this section are consistent with Algorith 2. Using Lea 3.6, we have { Pr d(y i, y j ) d(x i, x j ) } Cδ 0.0. (5.0) sup i,j [N] Now consider the first-block binary codes generated fro Gaussian Toeplitz projection. We focus on two interediate points y and y 2. Consider the first block of binary codes generated fro the second part of Algorith 2. We let u = sign ( Ψ () y ), v = sign ( Ψ () y 2 ). Suppose Ψ () contains Gaussian Toeplitz atrix T. For any i [/B], we have Since T i is a Gaussian rando vector, we have u i = sign ( T i ζ, y ) = sign ( Ti, y ζ ). v i = sign ( T i ζ, y 2 ) = sign ( Ti, y 2 ζ ). Pr(u i v i ) = d(y ζ, y 2 ζ) = d(y, y 2 ). Let Z i = ( u i v i ), i [/B]. Following Lea (3.7), we know that i j u i u j, u i v j, v i v j, v i u j. Therefore {Z i } [/B] is a pair-wise independent sequence. By Markov s inequality, we have ( /B Pr /B Z i E(Z ) ) δ 20 B V ar(z ) δ 2 B 4 δ 2 4. (5.)

The last inequality holds by setting B. Therefore, we have δ 2 ( Pr d H (u, v) d(y, y 2 ) ) δ 4. Now consider total B block binary codes {u i } B {v i} B fro y and y 2 respectively. Let E i = ( d H (u i, v i ) d(y, y 2 ) δ ), i [B]. Fro (5.), we have Pr(E i = ) < 4. If ore than half of E i are 0, then the edian of {d H (u i, v i )} B is within δ away fro d(y, y 2 ). Then we have ( Pr edian ( {d H (u i, v i )} B ) d(y, y 2 ) ) δ Pr ( B B E i ) ( Pr 2 B B E i E(E i ) > 4 ) exp( 4 B). In the second inequality, we use (5.). The last step follows fro Hoeffding s inequality. Now we use a union bound for N 2 pairs ( Pr d H (b i, b j ) d(y i, y j ) ) δ N 2 exp( 4 B) exp( 8 B). sup i,j [N] The last inequality holds by setting B 6 log N. triangle inequality, we coplete the proof. Cobing the above result and (5.0) using 5.4 Proof of Theore 3.0 For any set K S p, we use N δ (K) to denote a constructed δ-net of K, which is a δ-covering set with iniu size. In particular, by Sudakov s theore (e.g., Theore 3.8 in Ledoux and Talagrand (99)) log N δ (K) w(k)2 δ 2. We first prove that for a fixed two diensional space, = O( δ 2 ) independent Gaussian easureents are sufficient to achieve δ-unifor binary ebedding. Lea 5.6. Suppose K is any fixed two-diensional subspace in S p. Let A R p be a atrix with independent rows A i N (0, I p ), i []. Suppose log δ 2 δ, then with probability at least 3 exp( δ 2 ), sup d A (x, y) d(x, y) Cδ. (5.2) x,y K Here C is soe absolute constant. Proof. We postpone the proof to Appendix C. The next lea shows that the noralized l nor of Ax provides decent approxiation of x 2. 2

Lea 5.7. Consider any set K R p. A i N (0, I p ) for any i []. Consider We have where d(k) = ax x K x 2. Z = sup x K Let A be an -by-p atrix with independent rows A i, x 2 π x 2. Pr { Z 4 w(k) + t } 2 exp ( t2 ), t > 0. 2d(K) 2 Proof. See the proof of Lea 2. in Plan and Vershynin (204). In order to connect l nor to Haing distance, we need the following result. Lea 5.8. Consider finite nuber of points K S p. independent rows A i N (0, I p ) for any i []. Suppose Let A be an -by-p atrix with log K, δ2 then we have sup x K { Ai, x } δ 2δ. with probability at least exp( δ 2 ). Proof. Let X N (0, ). For any fixed point x K and any i [], we have Pr( A i, x δ) = Pr( X δ) δ. Let Z i = ( A i, x δ), i []. Then by using Hoeffding s inequality, Pr( Z i E(Z ) > δ) exp( 2δ 2 ). As E(Z ) = Pr( A i, x δ) δ, we conclude that with probability at least exp( 2δ 2 ), Z i 2δ. By applying union bound over K points and setting δ 2 log K, we coplete the proof. Now we are ready to prove Theore 3.0. 22

Proof of Theore 3.0. We construct a δ-net of K that is denoted as N δ. log N δ 2 δ. Applying Proposition 2.2 and setting K = N δ, we have that We assue sup d A (x, y) d(x, y) δ (5.3) x,y N δ with probability at least 2 exp( δ 2 ). For any two fixed points x, y K, let x, y be their nearest points in N δ. Then we have d(x, y) d A (x, y) d(x, y) d(x, y ) + d(x, y ) d A (x, y) (a) d(x, y ) d A (x, y) + 2δ d A (x, y ) d A (x, y) + d(x, y ) d A (x, y ) + 2δ (b) d A (x, y ) d A (x, y) + 3δ d A (x, y ) d A (x, y) + d A (x, y) d A (x, y) + 3δ (c) d A (y, y) + d A (x, x) + 3δ, (5.4) where (a) follows fro d(x, y) d(x, y ) d(x, y) d(x, y) + d(x, y) d(x, y ) d(x, x ) + d(x, y ) 2δ, step (b) follows fro (5.3), step (c) follows fro the triangle inequality of Haing distance. Therefore we have sup da (x, y) d(x, y) 2 sup sup d A (x, x ) + 3δ. (5.5) x,y K x N δ x K x x 2 δ Next we bound the tail ter Recall that T := sup sup d A (x, x ). x N δ x K x x 2 δ K + δ := K { z S p : z = x y x y 2, x, y K if δ 2 x y 2 δ }. Now we construct a δ-net for K + δ \ K denoted as N δ. For two distinct points x, y N δ Nδ, let C(x, y) denote the unit circle spanned by x, y. We construct δ 2 -net C δ 2(x, y) for each circle C(x, y). For siplicity, we just let C δ 2(x, y) be the set of points that uniforly split C(x, y) with interval δ 2. We thus have C δ 2(x, y). Let G δ 2 δ denote the union of all circle nets C δ 2(x, y) spanned by points in N δ Nδ, naely G δ := C δ 2(x, y) {x, y}. x,y N δ Nδ For any point x K, we can always find a point in G δ that is O(δ 2 ) away fro x. To see why the arguent is true, we first let x be the nearest point to x in N δ. If x x 2 δ 2, 23

then x is the point we want. Otherwise, we have δ 2 x x 2 δ. In this case, we have (x x )/ x x K +. Following the definition of K + δ, we can always find a point x N δ Nδ such that x x x 2 δ, (5.6) x x 2 thereby x ( x x 2 x ) + x } {{ } 2 δ x x 2 δ 2. z Note that z 2 is very close to because δ 4 x z 2 2 z 2 2 2 z, x + z 2 2 2 z 2 + = ( z 2 ) 2. We thus have x z/ z 2 2 x z 2 + z z/ z 2 2 = x z 2 + z 2 2δ 2. Note that z is in the unit circle C(x, x ) spanned by x and x, thereby there exists u C δ 2(x, x ) such that u x 2 δ 2. Point u thus satisfies x u x z 2 + z u 2 3δ 2. (5.7) So for any x K and its nearest point x N δ, we define u as { x, x x 2 δ 2 ; u := argin v Cδ 2 (x,x ) x v 2, otherwise. where x N δ N δ and satisfies (5.6). Based on (5.7), we always have u x 2 3δ 2 and u x 2 u x 2 + x x 2 2δ. By triangle inequality of Haing distance, We thus have d A (x, x ) d A (x, u) + d A (u, x ). T sup sup d A (x, u) + d A (u, x ) x N δ x K x x 2 sup u G δ sup d A (x, u) x K x u 2 3δ } {{ 2 } T sup x K x u 2 3δ 2 + sup sup d A (u, v). x,y N δ N δ u,v C(x,y) } u v 2 2δ {{ } T 2 Next we bound ter T and T 2 respectively. Ter T. For a fixed point u G δ, using Lea 5.7 by setting (K, t) in the stateent to be K = (K {u}) {u R p : u 2 3δ 2 } and δ 2 respectively yields that { Pr A i, x u } 2 π x u 2 4w(K ) + δ 2 2 exp ( δ4 2d(K ) 2 ) 2 exp( /8). 24

Then with probability greater than 2 exp( /8), A i, x u 2 3 π δ2 + 4w(K )/ + δ 2 5δ 2, sup x K x u 2 3δ 2 where the last inequality follows fro the fact that w(k ) w(k) and our assuption w(k) 2 /δ 4. We define event { E := A i, x u 5δ }. 2 sup u G δ sup x K x u 2 3δ 2 Applying union bound over all points in G δ, we have sup u G δ Pr(E c ) 2 G δ exp( /8) 2 exp( /36), where the last inequality holds with log G δ. Under condition event E happens, we have { Ai, u x } 5δ δ. (5.8) sup x K x u 2 3δ 2 If sign ( A i, u ) sign ( A i, x ), we ust have A i, u A i, u x. We then have T sup u G δ sup u G δ sup x K x u 2 3δ 2 { Ai, u } 5δ + δ, { Ai, u A i, u x } where the last inequality follows fro (5.8). Using Lea 5.7 by setting K and δ in the stateent to be G δ and 5δ respectively, we have that, when c log G δ 2 δ with soe absolute constant c, the following inequality { sup Ai, u } 5δ 0δ u G δ holds with probability at least exp( 25δ 2 ). Putting all ingredients together, we have T δ with high probability. Ter T 2. There are at ost N δ N δ 2 different two-diensional subspaces constructed fro N δ N δ. Applying Lea 5.6 and probabilistic union bound over all subspaces yields that ( ) Pr T 2 (C + 2)δ 3 N δ N 2 δ exp( δ 2 ) 3 exp( δ 2 /2), where the last inequality holds by setting log N δ 2 δ N δ. Putting (5.5) and the upper bounds of ter T together, we conclude that by choosing ax { w(k) 2 /δ 4, log G δ, 25 δ 2 log N δ N δ },

we have sup d A (x, y) d(x, y) δ. x,y K with probability at least c exp( c 2 δ 2 ) where c, c 2 are soe absolute constants. Using the fact that G δ δ 2 N δ N δ and log N δ N δ δ 2 w(n δ N δ ) 2 δ 2 w(k+ δ )2, we coplete the proof. A Proof of Lea 3.6 Proof. Recall that y i = p Φ(ζ) x i. We let Fro condition (3.7), we have ŷ i = y i y i 2, ŷ j = y j y j 2. y i ŷ i 2 δ, y j ŷ j 2 δ. (A.) Let θ = (x i, x j ), θ = (ŷ i, ŷ j ). Without loss of generality, we assue our set K = {x i } N is syetric, i.e., if x K then x K. Suppose we show for any two points x i, x j with xi, x j > 0, inequality (3.8) holds, then for xi, x j with x i, x j < 0, we iediately have d(y i, y j ) d(x i, x j ) = d(y i, y j ) ( d(x i, x j ) ) = d( y i, y j ) d( x i, x j ) Cδ. In the second equality, we use d( x, y) + d(x, y) =, x, y S p. In the last inequality, we use the fact that fast JL transfor p Φ(ζ) is linear thus y i = p Φ(ζ)( x i). Therefore, without loss of generality, we assue x i, x j 0 thus θ π 2. Now we turn to the following quantity ŷ i ŷ j 2 = ŷ i y i + y i y j + y j ŷ j 2 ŷ i y i 2 + ŷ j y j 2 + y i y j 2 2δ + x i x j 2 ( + δ). The last inequality follows fro (A.) and condition (3.6). Siilarly, we also have ŷ i ŷ j 2 x i x j ( δ) 2δ. Using the fact that we have sin θ 2 sin θ 2 sin θ ŷi 2 = ŷ 2 j, sin θ xi 2 2 = x 2 j, 2 = ŷi ŷ 2 j xi x 2 j xi x 2 j δ + δ 2 2 2 2δ. 26

When δ < 3 2 4, we have sin θ 2 sin θ 3 2 3 2 + 2 2. In the last inequality, we use sin θ 2 2 2, θ [0, π/2]. So θ /2 [0, π/3]. Using the fact that, for any two θ, θ [0, π/3], there exists constant c such that sin θ sin θ c θ θ, we have that Therefore, In the case δ > 8 3 2. θ 2 θ θ sin 2 c 2 sin θ 2δ 2 c. d(yi, y j ) d(x i, x j ) = θ θ Cδ. π 3 2 4, trivially we have d(yi, y j ) d(x i, x j ) 2 Cδ with constant C = B Proof of Lea 5.4 Proof. For positive seidefinite atrix M R p p with rank d, let λ, λ 2,...λ d be its positive eigenvalues. Using the definition of Frobenius nor, we have M 2 F = d λ 2 i = (M i,j ) 2 p + (p 2 p)δ2. 2 i,j [n] On the other hand, considering the trace of M, we can obtain d λ i = Trace(M) p( δ ). (B.) Using Cauchy-Schwarz inequality, we have d d ( λ i ) 2 d λ 2 i. (B.2) Plugging (B.) and (B.2) into the above inequality yields d p( δ ) 2 + (p )δ2 2. 27

C Proof of Lea 5.2 Proof. Without loss of any generality, we assue K = {x S p : supp(x) {, 2}}. We begin with constructing a δ-net denoted as N δ for set K. For siplicity, we can just let N δ (K) be the set of points that split the circle spanned by {e, e 2 } uniforly. Therefore N δ (K) = O( δ ). Applying Proposition 2.2 gives us sup d A (x, y) d(x, y) δ, x,y N δ (C.) holds with probability at least 2 exp ( δ 2 ) when log( δ 2 δ ). For any point x K, A i, x only depends on the first two coordinates of A i. Therefore, for siplicity, we let A i = A i (e +e 2 ) A i (e +e 2 ) 2, i []. For any point say x N δ, using the unifor distribution of A i, we have Pr( A i, x δ) Cδ, holds with soe absolute constant C. Using Hoeffding s inequality and probabilistic union bound over all points in N δ, we have ( Pr sup x N δ { A i, x δ } ) > (C + )δ N δ exp( 2δ 2 ) exp( δ 2 ). (C.2) The last inequality holds when δ 2 log δ. Now we consider any point x K. Suppose x is the closest point to x in N δ. We note that if sign ( A i, x ) sign ( A i, x ), then there exists λ [0, ] such that A i, λx + ( λ)x = 0. We thus have A i, x = λ A i, x x λ x x 2 δ. Further we obtain that sup x N δ sup d A (x, x ) = sup x K x N δ x x 2 δ sup x N δ sup x K x x 2 δ (sign ( A i, x ) sign ( A ) i, x ) { A i, x δ }. Cobining the above result with (C.2), we obtain that, with probability at least exp( δ 2 ), sup sup d A (x, x ) (C + )δ. x N δ x K x x 2 δ (C.3) 28

For any points x, y K, let x, y be their nearest points in N δ. We have d(x, y) d A (x, y) d(x, y) d(x, y ) + d(x, y ) d A (x, y) (a) d(x, y ) d A (x, y) + 2δ d(x, y ) d A (x, y ) + d A (x, y ) d A (x, y) + 2δ (b) d A (x, y ) d A (x, y) + 3δ d A (x, y ) d A (x, y) + d A (x, y) d A (x, y) + 3δ (c) d A (y, y) + d A (x, x) + 3δ (d) (2C + 5)δ, where (a) follows fro d(x, y) d(x, y ) d(x, y) d(x, y) + d(x, y) d(x, y ) d(x, x ) + d(x, y ) 2δ, step (b) follows fro (C.), step (c) follows fro the triangle inequality of Haing distance, step (d) is fro (C.3). References Nir Ailon and Bernard Chazelle. Approxiate nearest neighbors and the fast johnson-lindenstrauss transfor. In Proceedings of the thirty-eighth annual ACM syposiu on Theory of coputing, pages 557 563. ACM, 2006. Nir Ailon and Edo Liberty. An alost optial unrestricted fast johnson-lindenstrauss transfor. ACM Transactions on Algoriths (TALG), 9(3):2, 203. Nir Ailon and Holger Rauhut. Fast and rip-optial transfors. Discrete & Coputational Geoetry, 52(4): 780 798, 204. Noga Alon. Probles and results in extreal cobinatoricsâăťi. Discrete Matheatics, 273():3 53, 2003. Alexandr Andoni and Piotr Indyk. Near-optial hashing algoriths for approxiate nearest neighbor in high diensions. In Foundations of Coputer Science, 2006. FOCS 06. 47th Annual IEEE Syposiu on, pages 459 468. IEEE, 2006. Mahdi Cheraghchi, Venkatesan Guruswai, and Aeya Velingker. Restricted isoetry of fourier atrices and list decodability of rando linear codes. SIAM Journal on Coputing, 42(5):888 94, 203. Yunchao Gong and Svetlana Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. In Coputer Vision and Pattern Recognition (CVPR), 20 IEEE Conference on, pages 87 824, 20. Yunchao Gong, Sanjiv Kuar, Henry A Rowley, and Svetlana Lazebnik. Learning binary codes for highdiensional data using bilinear projections. In Coputer Vision and Pattern Recognition (CVPR), 203 IEEE Conference on, pages 484 49. IEEE, 203. Laurent Jacques, Jason N Laska, Petros T Boufounos, and Richard G Baraniuk. Robust -bit copressive sensing via binary stable ebeddings of sparse vectors. arxiv preprint arxiv:04.360, 20. TS Jayra and David P Woodruff. Optial bounds for johnson-lindenstrauss transfors and streaing probles with subconstant error. ACM Transactions on Algoriths (TALG), 9(3):26, 203. 29

Willia B Johnson and Jora Lindenstrauss. Extensions of lipschitz appings into a hilbert space. Conteporary atheatics, 26(89-206):, 984. Felix Kraher and Rachel Ward. New and iproved johnson-lindenstrauss ebeddings via the restricted isoetry property. SIAM Journal on Matheatical Analysis, 43(3):269 28, 20. Felix Kraher, Shahar Mendelson, and Holger Rauhut. Suprea of chaos processes and the restricted isoetry property. Counications on Pure and Applied Matheatics, 67():877 904, 204. Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperietry and processes, volue 23. Springer, 99. Wei Liu, Jun Wang, Sanjiv Kuar, and Shih-Fu Chang. Hashing with graphs. In Proceedings of the 28th International Conference on Machine Learning, 20. Jelani Nelson, Eric Price, and Mary Wootters. New constructions of rip atrices with fast ultiplication and fewer rows. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Syposiu on Discrete Algoriths, pages 55 528. SIAM, 204. Mohaad Norouzi, David M Blei, and Ruslan Salakhutdinov. Haing distance etric learning. In Advances in Neural Inforation Processing Systes, pages 06 069, 202. Yaniv Plan and Roan Vershynin. Robust -bit copressed sensing and sparse logistic regression: A convex prograing approach. Inforation Theory, IEEE Transactions on, 59():482 494, 203. Yaniv Plan and Roan Vershynin. Diension reduction by rando hyperplane tessellations. Discrete & Coputational Geoetry, 5(2):438 46, 204. Maxi Raginsky and Svetlana Lazebnik. Locality-sensitive binary codes fro shift-invariant kernels. In Advances in neural inforation processing systes, pages 509 57, 2009. Mark Rudelson and Roan Vershynin. On sparse reconstruction fro fourier and gaussian easureents. Counications on Pure and Applied Matheatics, 6(8):025 045, 2008. Ruslan Salakhutdinov and Geoffrey Hinton. Seantic hashing. International Journal of Approxiate Reasoning, 50(7):969 978, 2009. Yair Weiss, Antonio Torralba, and Rob Fergus. Spectral hashing. In Advances in neural inforation processing systes, pages 753 760, 2009. Felix X Yu, Sanjiv Kuar, Yunchao Gong, and Shih-Fu Chang. Circulant binary ebedding. arxiv preprint arxiv:405.362, 204. 30