Tech Tips: 7月 2015

2015年7月21日火曜日

scikit-learn(5) Gradient Boosting

　以前から気になっていたGradient Boostingについて勉強した。

Kaggleのトップランカーたちを見ていると、SVM、Random Forest、Neural Network、Gradient Boostingの4つをstackingして使っていることが多い。SVMとRFは使えるので、そろそろGradient Boostingも使えるようになっておきたところ。

Gradient Boostingの基本

いいビデオがあった。これを見ると基本的なアイディアはつかめる。

Copyright Information

Prof. Alexander Ihler, Ensembles (3): Gradient Boosting

License: Standard YouTube License

Boostingとは？

・アンサンブル学習の一種

・弱学習器を直列に繋ぐ

・各段の学習器は前段でエラーとなったサンプルに重点を置く

Gradient Boostingとは？

・Boostingモデルの一つ

・以下の手続きを繰り返す

　1. 弱学習器を学習

　2. 残差（観測値と予測値の差）を測定

　3. 残差にフィットするように次の弱学習器を学習

なぜGradient Boostingと呼ばれるか？

残差を使って学習器の予測値を逐次的に更新する

⇒Gradient DescentでMSEを最小化することと等価

※MSEの勾配 ~ 残差

Gradient Boostingの実用的な話

scikit-learnのdeveloperの人のビデオに他のモデルとの比較やパラメータチューニングの話があった。

Copyright Information

Peter Prettenhofer, Gradient Boosted Regression Trees in scikit-learn

License: Standard YouTube License

なぜGradient Boosting？
・Gradient Boostingは有名なデータサイエンス系のコンペティションで頻繁に用いられている

Gradient Boostingの利点
・不均一なデータを扱える（特徴量のスケールが違ってもOK）
・いろいろな損失関数をサポート
・特徴量の非線形な相互作用を自動検出

scikit-learnで使用する際のパラメータチューニング
・n_estimators（boostingのステージ数）=3000で他のハイパーパラメータをgride searchでチューニング
・n_estimatorsを大きくし、learning_rate（学習率；弱学習器を結合するときの係数）をチューニング

サンプルコード

scikit-learnでボストンの家の価格を推定するサンプルコードを書いてみた。
パラメータチューニングはPeter氏が推奨している方法でおこなった。

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.grid_search import GridSearchCV
from sklearn.datasets import load_boston

data = load_boston()

x = data['data']
y = data['target']

# tune hyper parameters first
model = GradientBoostingRegressor(n_estimators=3000)
parameters = {'learning_rate' : [0.1, 0.05, 0.02, 0.01], 
              'max_depth': [4, 6],
              'min_samples_leaf': [3, 5, 9, 17],
              'max_features': [1.0, 0.3, 0.1]}

gscv = GridSearchCV(model, parameters, verbose=10, n_jobs=-1, cv=4)
gscv.fit(x, y)
print "best score=", gscv.best_score_

# tune learning rate with higher n_estimators
model = gscv.best_estimator_
model.set_params(n_estimators=100000)
parameters = {'learning_rate' : [0.1, 0.05, 0.02, 0.01]}
gscv = GridSearchCV(model, parameters, verbose=10, n_jobs=-1, cv=4)
gscv.fit(x, y)
print "best score=", gscv.best_score_

2015年7月19日日曜日

Normalized compression distanceとNormalized Google distance

文章間の類似度を測るおもしろい指標を発見したので、メモ。

Normalized compression distance

１つ目はNormalized compression distance（NCD）。日本語でいうと正規化圧縮距離。

アイデアを簡単にいうと以下のとおり。

文章xと文章yがある。

Z(d)をドキュメントdを圧縮したときのサイズ（圧縮アルゴリズムはgzip, gzip2など）とする。

もしxとyの関連性が高ければ、Z(x + y)と、Z(x)およびZ(y)間の差は小さくなるはずである。

以下Pythonで書いたサンプルコード。
共通する単語を多く含むものが関連性が高いと判断される。

# -*- coding: utf-8 -*-

import zlib

"""Normalized compression distance

See the page below for more details:
https://en.wikipedia.org/wiki/Normalized_compression_distance#Normalized_compression_distance

"""
def ncd(x,y):
    if x == y:
        return 0
    z_x = len(zlib.compress(x))
    z_y = len(zlib.compress(y))
    z_xy = len(zlib.compress(x + y))    
    return float(z_xy - min(z_x, z_y)) / max(z_x, z_y)

if __name__ == '__main__':
    query = 'Hello, world!'
    results = ['Hello, Python world!', 
               'Goodbye, Python world!', 
               'world record']

    for r in results:
        print r, ncd(query, r)

Normalized Google distance

2つ目はNormalized Google distance（NGD）。日本語でいうと正規化Google距離。

Googleじゃなくてもクエリに対する結果ページの総数を返してくれる検索エンジンであれば何でもよい。
似たような方法を昔自分で試したことがあるが、既知の手法だったらしい。

アイデアを簡単にいうと以下のとおり。

文章xと文章yがある。

F(q)をクエリqに対する検索結果件数とする。

文章xと文章yの関連性が高ければ、F(x + y)と、F(x)およびF(y)間の差は小さくなるはずである。

NGDは、NCDと違って、セマンティックな類似度を測定することが出来る。

以下Pythonで書いたサンプルコード。

きちんとbig apple = new york、SVM = support vector machineという関連性を距離に反映出来ている！！
以下のサンプルではクエリにマッチした検索結果はAPIで取ってきているが、APIには使用回数制限があるため注意。

# -*- coding: utf-8 -*-

import urllib
import json
from math import log

"""Normalized Google distance

See the page below for more details:
https://en.wikipedia.org/wiki/Normalized_compression_distance#Normalized_Google_distance

"""

def gcnt(q):
    url = 'https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=%s' % urllib.quote(q)
    res = urllib.urlopen(url)
    json_res = json.loads(res.read())
    return json_res['responseData']['cursor']['estimatedResultCount']

def ngd(x,y):
    if x == y:
        return 0
    f_x = log(float(gcnt(x)))
    f_y = log(float(gcnt(y)))
    f_xy = log(float(gcnt(x + y)))
    N = 50 * 1e9   # total number of indexed pages
    return (max(f_x, f_y) - f_xy) / (log(N) - min(f_x, f_y))

if __name__ == '__main__':
    # ex1) Which city is associated with "big apple?"
    query = 'big apple'
    results = ['Tokyo', 
               'London', 
               'Paris', 
               'New York']
    for r in results:
        print r, ngd(query, r)

    # ex2) Which is related with "SVM?"
    query = 'SVM'
    results = ['Random Forest', 
               'Gradient Boosting', 
               'Artificial Neural Network', 
               'Support Vector Machine']
    for r in results:
        print r, ngd(query, r)

2015年7月9日木曜日

文字間の距離をいろいろ試せるライブラリ

　最近参加したKaggleのコンテストでTF-IDFのcosine similarityやBM25というテキスト間の距離概念を知った。他にもテキスト間の距離が存在していて、PythonのDistanceというライブラリを使うと簡単に計算できる。

このライブラリの便利なところは、”文字”ではなく”単語”を構成要素として距離を測れるところ。

例えば、
hoge fuga foo bar
hige foge hoo car

のハミング距離を”文字”単位で測ると4（文字数で正規化して0.235）で、結構近いなーとなるが、
単語単位で見ると4（単語数で正規化して1.000）で遠いことになる。
テキスト処理の場合はBOWのように単語単位で処理する場合が多いと思うので、これは嬉しい。

　Distanceで計算できる距離には、

などがある。

　ちょっと遊んでみたのでコードを載せておく。

wikipediaからSVM, RF, ANNの説明文を持ってきて、"support vector machines"との距離を測ってみた。本来なら、lowercase変換/stopwords除去/stemmingなどの前処理を行うべきだが、今回はお試し実装なので省略した。

import re
import distance

if __name__ == '__main__':

    query = 'support vector machines'
    text = [
        'In machine learning, support vector machines (SVMs, also support vector networks[1]) \
        are supervised learning models with associated learning algorithms that analyze data \
        and recognize patterns, used for classification and regression analysis.',
        
        'Random forests are an ensemble learning method for classification, regression and \
        other tasks, that operate by constructing a multitude of decision trees at training \
        time and outputting the class that is the mode of the classes (classification) or \
        mean prediction (regression) of the individual trees.',
        
        'In machine learning and cognitive science, artificial neural networks (ANNs) are \
        a family of statistical learning models inspired by biological neural networks \
        (the central nervous systems of animals, in particular the brain) and are used to \
        estimate or approximate functions that can depend on a large number of inputs and \
        are generally unknown.'
    ]

    q = re.split('\W+', query)
    for i, t in enumerate(text):
        w = re.split('\W+', t)
        w = filter(len, w)
        
        print "text_%d: %f %f %f %f" % (i+1, distance.nlevenshtein(q, w, method=1), \
                distance.nlevenshtein(q, w, method=2), distance.sorensen(q, w), distance.jaccard(q, w))

以下結果。text_1がSVMの説明である可能性が高そうと予測できる。

text_1: 0.906250 0.906250 0.800000 0.888889
text_2: 1.000000 1.000000 1.000000 1.000000
text_3: 1.000000 1.000000 1.000000 1.000000

2015年7月8日水曜日

Pythonでレーベンシュタイン距離を計算

levenというライブラリを使うとレーベンシュタイン距離（編集距離）を簡単に計算出来る。
文章データのfeature engineeringをするときに使えそう。

from leven import levenshtein

print levenshtein("abcd", "acd")  # dist=1
print levenshtein("ab", "abcd")   # dist=2
print levenshtein("hoge", "hige") # dist=1

ちなみにレーベンシュタイン距離は動的計画法の有名問題の一つ。自分で解いたこと無い人は解いてみると面白いかもしれない。

登録: 投稿 (Atom)

Tech Tips

Page List

Search on the blog