Locality Sensitive Binary Codes for Shift Invaliant KernelsとSpectral Hashingの比較

Locality Sensitive Hashing(LSH)とは、ベクトルとして表現されたデーターの集合を入力として、それらの２点間の距離を保存したまま、ハミング距離に基づく文字列の集合に射影する技術です。コサイン距離[1]、ユーグリッド距離[2]に基づくものや、機械学習法を応用した、semantic hashing[3], spectral hashing[4], kernelized LSH[5], その他[6][7][8]、現在までに多くの手法が提案されています。この背景には、Googleが、昔に提案されたLSHが、ニュース記事の推薦システムで使えることを示した[9]のきっかけに、現在、推薦システム、画像検索、文章のクラスタリング[10]など、色々なシステムや研究の場面で利用されています。

理論的な収束の保証があるという意味で、オリジナルのコサイン距離ベース[1]の手法が良いのですが、昨年の年末に開催された、NIPS2010にて, kernel類似度ベースの手法[11]が提案されて、収束性の理論的保証もあり、性能も良いということで実装して、PCAに基づくspectral hashingと比較してみました。比較法としては、1000点の960次元のベクトル表現された画像データーを、それぞれの手法でバイナリの文字列へと射影して、総当たりでユーグリッド距離とハミング距離を計算して、横軸がユーグリッド距離、縦軸がハミング距離の図を作成しました、下がその結果ですが、Locality Sensitive Binary Codes(LSBC)がNIPSで提案された手法です。続いて、spectral hashing(sp)となってます。LSBCの方が、ビットを、128bit, 512bit, 960bitと射影する文字列を長くする程、ハミング距離がユーグリッド距離をよく近似しているということが、定性的にですが、わかりました。だいたい512bitもあると収束するようです。一方、spectral hasingは、bitを長くしても精度が上がりません。この理由は、spectral hashingは、PCAで、固有値が大きい次元から順に選択しているために、ある程度の次元を選択したところで収束して、それ以上次元を上げても、精度が上がりません。よって、bitを長くする毎に、精度のよくなるLSBCの方が良いようです。実装は、後でまとめて公開します。

追記：
さらに、Spectral Hashingを改良した手法が、今年のICMLで提案されています。
もう、入手可能なようです。
J.Wang, S.Kumar, S.Chang. Sequenctial Projection Learning for Hashing with Compact Codes, ICML2010
http://www.ee.columbia.edu/ln/dvmm/publications/10/SPLH_ICML2010.pdf

読んでみたのですが、短いビットでの精度がさらに上がっているようです。

・参考文献
[1]M.Goemans and D.Willianmson. Imrpoved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM, 42(6):1115-1145.
[2]M.Datar,N.Immorlica, P.Indyk, and V.Mirrokni.Locality sensitive hashing scheme based on p-stable distributions. In Proceedings of the ACM Symposium on Computational Geometry, 2004.
[3]R.Salakhutdinov and G.Hinton. Semantic hashing. In SIGIR Workshop on Information Retrieval and Applications of Graphical Models, 2007.
[4]Y.Weiss, A.Torralba, and R.Fergus. Spectral hashing. In Advances in Neural Information Processing Systems, 2009
[5]B.Kulis, P.Jain, and K.Grauman. Fast Similarity Search for learned Metrics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(12):2143,2009
[6]G.Shakhnarovich, P.Viola, and T.Darrell. Fast pose estimation with parameter-sensitive hashing. In Proceedings of the Ninth IEEE International Conference on Computer Vision, page 750, 2003
[7]B.Kulis, P.Jain, and K.Grauman. Fast Similarity Search for Learned Metrics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(12):2143,2009
[8]B.Kulis and T.Darrell. Learning to Hash with Binary Reconstructive Embeddings. In Advances in Neural Information Processing Systems, 2010
[9]A.Das,M.Datar,A.Garg,and S.Rajaram. Google news personalization: scalable online collaborative filtering. In Proceedings of the 16th International Conference on World Wide Web, page 280. ACM, 2007
[10]D.Ravichandra, P.Pantel, and E.Hovy. Randomized Algorithms and NLP: Using Locality Sensitive Hash Function for High Speed Noun Clustering. In Annual Meeting of the Association for Computational Linguistics, pages 345-356, 2005.
[11]M.Raginsky and S.Lazebnik. Locality-Sensitive Binary Codes from Shift-Invariant Kernels. In Advances in Neural Information Processing Systems,2010