코딩걸음마

[추천 시스템(RS)] Factorization Machine 본문

딥러닝 템플릿/추천시스템(RS) 코드

[추천 시스템(RS)] Factorization Machine

코딩걸음마 2022. 7. 19. 23:53
728x90

Factorization Machine은 SVM과 Factorization Model의 장점을 합친 모델이다.

Real valued Feature Vector를 활용한 General Predictor이다.

Factorization Mashine의 식은 linear time이다

일반적인 추천시스템은 special input data, optimization algorithm등이 필요하다.

반면, Factorization Machine은 어느곳에서든 쉽게 적용가능하다.

SVM은 sparse(희소행렬)한 상황에서 매우 부적절하다는 점을 보완한다. sparse 상황에서도 잘 작동

 

주요컨셉 sparse한 interation data를 onehot vectorize하여 보완 후 학습

 

 

import os
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
import scipy
import math
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

 

1. 데이터 불러오기 및 train-test set 분리

사용할 Dataset은 KMRD Dataset을 사용하였다.

https://github.com/lovit/kmrd 

 

GitHub - lovit/kmrd: Synthetic dataset for recommender system created from Naver Movie rating system

Synthetic dataset for recommender system created from Naver Movie rating system - GitHub - lovit/kmrd: Synthetic dataset for recommender system created from Naver Movie rating system

github.com

data_path = "파일경로"
%cd $data_path

if not os.path.exists(data_path):
  !git clone https://github.com/lovit/kmrd
  !python setup.py install
else:
  print("data and path already exists!")

파일 경로를 지정해준다. 만약에 없다면, git clone을 통해 가져오는 코드를 작성했다.

Data Loader

rates.csv

df = pd.read_csv(os.path.join(path,'rates.csv'))
train_df, val_df = train_test_split(df, test_size=0.2, random_state=1234, shuffle=True)

movie dataframe

# Load all related dataframe
movies_df = pd.read_csv(os.path.join(path, 'movies.txt'), sep='\t', encoding='utf-8')
movies_df = movies_df.set_index('movie')

castings_df = pd.read_csv(os.path.join(path, 'castings.csv'), encoding='utf-8')
countries_df = pd.read_csv(os.path.join(path, 'countries.csv'), encoding='utf-8')
genres_df = pd.read_csv(os.path.join(path, 'genres.csv'), encoding='utf-8')

# Get genre information
genres = [(list(set(x['movie'].values))[0], '/'.join(x['genre'].values)) for index, x in genres_df.groupby('movie')]
combined_genres_df = pd.DataFrame(data=genres, columns=['movie', 'genres'])
combined_genres_df = combined_genres_df.set_index('movie')

# Get castings information
castings = [(list(set(x['movie'].values))[0], x['people'].values) for index, x in castings_df.groupby('movie')]
combined_castings_df = pd.DataFrame(data=castings, columns=['movie','people'])
combined_castings_df = combined_castings_df.set_index('movie')

# Get countries for movie information
countries = [(list(set(x['movie'].values))[0], ','.join(x['country'].values)) for index, x in countries_df.groupby('movie')]
combined_countries_df = pd.DataFrame(data=countries, columns=['movie', 'country'])
combined_countries_df = combined_countries_df.set_index('movie')

movies_df = pd.concat([movies_df, combined_genres_df, combined_castings_df, combined_countries_df], axis=1)

print(movies_df.shape)
print(movies_df.head())
  • factorization machine -> feature vector
  • feature vector : user onehot vector + item onehot vector + meta information + other feature engineered vectors
# genre onehot vector
dummy_genres_df = movies_df['genres'].str.get_dummies(sep='/')
dummy_genres_df.head()

 

movies_df['grade'].unique()

Out

array(['전체 관람가', '12세 관람가', 'PG', '15세 관람가', 'NR', '청소년 관람불가', 'PG-13',
       'R', 'G', nan], dtype=object)

 

dummy_grade_df = pd.get_dummies(movies_df['grade'], prefix='grade')
dummy_grade_df.head()

 

Convert To Factorization Machine format 

  • user를 one-hot vector로 나타낸다
  • item을 one-hot vector로 나타낸다
  • movies_df에서 categorical features를 만든다

train_df['movie'].apply(lambda x: dummy_genres_df.loc[x])

 

장르 Onehot vector

 

test_df = pd.get_dummies(train_df['user'], prefix='user')
test_df.head()

test_df = pd.get_dummies(train_df['movie'], prefix='movie')

 

 

X_train = pd.concat([pd.get_dummies(train_df['user'], prefix='user'),
           pd.get_dummies(train_df['movie'], prefix='movie'),
           train_df['movie'].apply(lambda x: dummy_genres_df.loc[x]),
           train_df['movie'].apply(lambda x: dummy_grade_df.loc[x])], axis=1)
X_train = pd.concat([pd.get_dummies(train_df['user'], prefix='user'),
           pd.get_dummies(train_df['movie'], prefix='movie'),
           train_df['movie'].apply(lambda x: dummy_genres_df.loc[x]),
           train_df['movie'].apply(lambda x: dummy_grade_df.loc[x])], axis=1)
# 평균 평점이 높기 때문에 10만 1로 주고, 나머지는 -1로 binary 지정한다
# loss 계산을 0이 아닌 -1로 지정한다
y_train = train_df['rate'].apply(lambda x: 1 if x > 9 else -1)
print(X_train.shape)
print(y_train.shape)

# 0이 아닌 데이터 위치 확인하기위해 csr matrix 활용한다
# csr_matrix 설명 참고: https://rfriend.tistory.com/551
X_train_sparse = scipy.sparse.csr_matrix(X_train.values)

 

 

Train Factorization Machine

# Compute negative log likelihood between prediction and label
def log_loss(pred, y):
    return np.log(np.exp(-pred * y) + 1.0)
    
# Update gradients
def sgd(X, y, n_samples, n_features,
                w0, w, v, n_factors, learning_rate, reg_w, reg_v):
    data = X.data
    indptr = X.indptr
    indices = X.indices
    loss = 0.0

    for i in range(n_samples):
        pred, summed = predict(X, w0, w, v, n_factors, i)
        
        # calculate loss and its gradient
        loss += log_loss(pred, y[i])
        loss_gradient = -y[i] / (np.exp(y[i] * pred) + 1.0)
    
        # update bias/intercept term
        w0 -= learning_rate * loss_gradient

        # update weight
        for index in range(indptr[i], indptr[i + 1]):
            feature = indices[index]
            w[feature] -= learning_rate * (loss_gradient * data[index] + 2 * reg_w * w[feature])

        # update factor
        for factor in range(n_factors):
            for index in range(indptr[i], indptr[i + 1]):
                feature = indices[index]
                term = summed[factor] - v[factor, feature] * data[index]
                v_gradient = loss_gradient * data[index] * term
                v[factor, feature] -= learning_rate * (v_gradient + 2 * reg_v * v[factor, feature])
    
    loss /= n_samples
    return loss
def predict(X, w0, w, v, n_factors, i):
    data = X.data
    indptr = X.indptr
    indices = X.indices
    """predicting a single instance"""
    summed = np.zeros(n_factors)
    summed_squared = np.zeros(n_factors)

    # linear output w * x
    pred = w0
    for index in range(indptr[i], indptr[i + 1]):
        feature = indices[index]
        pred += w[feature] * data[index]

    # factor output
    for factor in range(n_factors):
        for index in range(indptr[i], indptr[i + 1]):
            feature = indices[index]
            term = v[factor, feature] * data[index]
            summed[factor] += term
            summed_squared[factor] += term * term

        pred += 0.5 * (summed[factor] * summed[factor] - summed_squared[factor])

    # gradient update할 때, summed는 독립이므로 re-use 가능
    return pred, summed

fit  구현

# Train Factorization Machine
# X -> sparse csr_matrix, y -> label
def fit(X, y, config):
    epochs = config['num_epochs']
    num_factors = config['num_factors']
    learning_rate = config['learning_rate']
    reg_weights = config['reg_weights']
    reg_features = config['reg_features']

    num_samples, num_features = X.shape
    weights = np.zeros(num_features) # -> w
    global_bias = 0.0 # -> w0
    
    # latent factors for all features -> v
    feature_factors = np.random.normal(size = (num_factors, num_features))

    epoch_loss = []
    for epoch in range(epochs):
        loss = sgd(X, y, num_samples, num_features,
                            global_bias, weights,
                            feature_factors, num_factors,
                            learning_rate, reg_weights, reg_features)
        print(f'[epoch: {epoch+1}], loss: {loss}')

        epoch_loss.append(loss)
      
    return epoch_loss

하이퍼파라미터 세팅

config = {
    "num_epochs": 10,
    "num_factors": 10,
    "learning_rate": 0.1,
    "reg_weights": 0.01,
    "reg_features": 0.01
}

 

학습

epoch_loss = fit(X_train_sparse, y_train.values, config)

시각화

import matplotlib.pyplot as plt
plt.plot(epoch_loss)
plt.title('Loss per epoch')
plt.show()
728x90
Comments