코딩걸음마
[추천 시스템(RS)] Factorization Machine 본문
Factorization Machine은 SVM과 Factorization Model의 장점을 합친 모델이다.
Real valued Feature Vector를 활용한 General Predictor이다.
Factorization Mashine의 식은 linear time이다
일반적인 추천시스템은 special input data, optimization algorithm등이 필요하다.
반면, Factorization Machine은 어느곳에서든 쉽게 적용가능하다.
SVM은 sparse(희소행렬)한 상황에서 매우 부적절하다는 점을 보완한다. sparse 상황에서도 잘 작동
주요컨셉 sparse한 interation data를 onehot vectorize하여 보완 후 학습
import os
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
import scipy
import math
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")
1. 데이터 불러오기 및 train-test set 분리
사용할 Dataset은 KMRD Dataset을 사용하였다.
GitHub - lovit/kmrd: Synthetic dataset for recommender system created from Naver Movie rating system
Synthetic dataset for recommender system created from Naver Movie rating system - GitHub - lovit/kmrd: Synthetic dataset for recommender system created from Naver Movie rating system
github.com
data_path = "파일경로"
%cd $data_path
if not os.path.exists(data_path):
!git clone https://github.com/lovit/kmrd
!python setup.py install
else:
print("data and path already exists!")
파일 경로를 지정해준다. 만약에 없다면, git clone을 통해 가져오는 코드를 작성했다.
Data Loader
rates.csv
df = pd.read_csv(os.path.join(path,'rates.csv'))
train_df, val_df = train_test_split(df, test_size=0.2, random_state=1234, shuffle=True)
movie dataframe
# Load all related dataframe
movies_df = pd.read_csv(os.path.join(path, 'movies.txt'), sep='\t', encoding='utf-8')
movies_df = movies_df.set_index('movie')
castings_df = pd.read_csv(os.path.join(path, 'castings.csv'), encoding='utf-8')
countries_df = pd.read_csv(os.path.join(path, 'countries.csv'), encoding='utf-8')
genres_df = pd.read_csv(os.path.join(path, 'genres.csv'), encoding='utf-8')
# Get genre information
genres = [(list(set(x['movie'].values))[0], '/'.join(x['genre'].values)) for index, x in genres_df.groupby('movie')]
combined_genres_df = pd.DataFrame(data=genres, columns=['movie', 'genres'])
combined_genres_df = combined_genres_df.set_index('movie')
# Get castings information
castings = [(list(set(x['movie'].values))[0], x['people'].values) for index, x in castings_df.groupby('movie')]
combined_castings_df = pd.DataFrame(data=castings, columns=['movie','people'])
combined_castings_df = combined_castings_df.set_index('movie')
# Get countries for movie information
countries = [(list(set(x['movie'].values))[0], ','.join(x['country'].values)) for index, x in countries_df.groupby('movie')]
combined_countries_df = pd.DataFrame(data=countries, columns=['movie', 'country'])
combined_countries_df = combined_countries_df.set_index('movie')
movies_df = pd.concat([movies_df, combined_genres_df, combined_castings_df, combined_countries_df], axis=1)
print(movies_df.shape)
print(movies_df.head())
- factorization machine -> feature vector
- feature vector : user onehot vector + item onehot vector + meta information + other feature engineered vectors
# genre onehot vector
dummy_genres_df = movies_df['genres'].str.get_dummies(sep='/')
dummy_genres_df.head()
movies_df['grade'].unique()
Out
array(['전체 관람가', '12세 관람가', 'PG', '15세 관람가', 'NR', '청소년 관람불가', 'PG-13',
'R', 'G', nan], dtype=object)
dummy_grade_df = pd.get_dummies(movies_df['grade'], prefix='grade')
dummy_grade_df.head()
Convert To Factorization Machine format
- user를 one-hot vector로 나타낸다
- item을 one-hot vector로 나타낸다
- movies_df에서 categorical features를 만든다
train_df['movie'].apply(lambda x: dummy_genres_df.loc[x])
장르 Onehot vector
test_df = pd.get_dummies(train_df['user'], prefix='user')
test_df.head()
test_df = pd.get_dummies(train_df['movie'], prefix='movie')
X_train = pd.concat([pd.get_dummies(train_df['user'], prefix='user'),
pd.get_dummies(train_df['movie'], prefix='movie'),
train_df['movie'].apply(lambda x: dummy_genres_df.loc[x]),
train_df['movie'].apply(lambda x: dummy_grade_df.loc[x])], axis=1)
X_train = pd.concat([pd.get_dummies(train_df['user'], prefix='user'),
pd.get_dummies(train_df['movie'], prefix='movie'),
train_df['movie'].apply(lambda x: dummy_genres_df.loc[x]),
train_df['movie'].apply(lambda x: dummy_grade_df.loc[x])], axis=1)
# 평균 평점이 높기 때문에 10만 1로 주고, 나머지는 -1로 binary 지정한다
# loss 계산을 0이 아닌 -1로 지정한다
y_train = train_df['rate'].apply(lambda x: 1 if x > 9 else -1)
print(X_train.shape)
print(y_train.shape)
# 0이 아닌 데이터 위치 확인하기위해 csr matrix 활용한다
# csr_matrix 설명 참고: https://rfriend.tistory.com/551
X_train_sparse = scipy.sparse.csr_matrix(X_train.values)
Train Factorization Machine
# Compute negative log likelihood between prediction and label
def log_loss(pred, y):
return np.log(np.exp(-pred * y) + 1.0)
# Update gradients
def sgd(X, y, n_samples, n_features,
w0, w, v, n_factors, learning_rate, reg_w, reg_v):
data = X.data
indptr = X.indptr
indices = X.indices
loss = 0.0
for i in range(n_samples):
pred, summed = predict(X, w0, w, v, n_factors, i)
# calculate loss and its gradient
loss += log_loss(pred, y[i])
loss_gradient = -y[i] / (np.exp(y[i] * pred) + 1.0)
# update bias/intercept term
w0 -= learning_rate * loss_gradient
# update weight
for index in range(indptr[i], indptr[i + 1]):
feature = indices[index]
w[feature] -= learning_rate * (loss_gradient * data[index] + 2 * reg_w * w[feature])
# update factor
for factor in range(n_factors):
for index in range(indptr[i], indptr[i + 1]):
feature = indices[index]
term = summed[factor] - v[factor, feature] * data[index]
v_gradient = loss_gradient * data[index] * term
v[factor, feature] -= learning_rate * (v_gradient + 2 * reg_v * v[factor, feature])
loss /= n_samples
return loss
def predict(X, w0, w, v, n_factors, i):
data = X.data
indptr = X.indptr
indices = X.indices
"""predicting a single instance"""
summed = np.zeros(n_factors)
summed_squared = np.zeros(n_factors)
# linear output w * x
pred = w0
for index in range(indptr[i], indptr[i + 1]):
feature = indices[index]
pred += w[feature] * data[index]
# factor output
for factor in range(n_factors):
for index in range(indptr[i], indptr[i + 1]):
feature = indices[index]
term = v[factor, feature] * data[index]
summed[factor] += term
summed_squared[factor] += term * term
pred += 0.5 * (summed[factor] * summed[factor] - summed_squared[factor])
# gradient update할 때, summed는 독립이므로 re-use 가능
return pred, summed
fit 구현
# Train Factorization Machine
# X -> sparse csr_matrix, y -> label
def fit(X, y, config):
epochs = config['num_epochs']
num_factors = config['num_factors']
learning_rate = config['learning_rate']
reg_weights = config['reg_weights']
reg_features = config['reg_features']
num_samples, num_features = X.shape
weights = np.zeros(num_features) # -> w
global_bias = 0.0 # -> w0
# latent factors for all features -> v
feature_factors = np.random.normal(size = (num_factors, num_features))
epoch_loss = []
for epoch in range(epochs):
loss = sgd(X, y, num_samples, num_features,
global_bias, weights,
feature_factors, num_factors,
learning_rate, reg_weights, reg_features)
print(f'[epoch: {epoch+1}], loss: {loss}')
epoch_loss.append(loss)
return epoch_loss
하이퍼파라미터 세팅
config = {
"num_epochs": 10,
"num_factors": 10,
"learning_rate": 0.1,
"reg_weights": 0.01,
"reg_features": 0.01
}
학습
epoch_loss = fit(X_train_sparse, y_train.values, config)
시각화
import matplotlib.pyplot as plt
plt.plot(epoch_loss)
plt.title('Loss per epoch')
plt.show()
'딥러닝 템플릿 > 추천시스템(RS) 코드' 카테고리의 다른 글
[추천 시스템(RS)] DeepFM Frame (0) | 2022.07.21 |
---|---|
[추천 시스템(RS)] Wide & Deep Learning for Recommender System (0) | 2022.07.20 |
[추천 시스템(RS)] Neural Collaborative Filtering (0) | 2022.07.19 |
[추천 시스템(RS)] Matrix Factorization (0) | 2022.07.19 |
[추천 시스템(RS)] Surprise 라이브러리를 활용한 추천 시스템 (0) | 2022.07.19 |