JAZZ IT UP

1. Introduction

Imagine you are in a subway and on the way to work. You feel so jaded because this will take you one hour to get to your company, so you take out your smartphone and open the web browser to read some news.

Always, the browser will show you some news on the first page, and there may be some most recent shocking news, which will make you feel interested in, or all these first-page news may be boring and let you feel even fatigued in the subway.

“Users’ click on our page is the only measure of our efforts!“

All internet companies want to let users use their service and keep users’ eyes on their service as long as they can. To gather more users, these companies will try all methods to attract the user. One of the most useful technology is the recommendation system.

The algorithm will learn from your history readings and give you some recommendations. In the beginning, the recommendations may be exceptional and not what you want at all, but as time goes, more efficient algorithms came out and more user information was gathered, the recommendation system becomes smarter and humanize, and the algorithm may be more familiar with you than yourself.

Let me show you the wonderful world of recommendation systems!

2. Recommendation system

A recommendation system is one usage of machine learning, which can learn users' preferences by looking at users’ history. As time goes, these systems always perform better than before with more user data.

Nowadays, people live in the world of information and electricity. No city citizens can live without smartphones. With the use of smart devices, people can arrange their daily life with the help of the algorithms inside these devices. As shown in figure 1, from finding a good canteen for lunch, watch an interesting movie for rest, to find the soul mate of the rest of your life...

Figure 1. People surrounded by recommended systems

Recommendation systems are meaningful in some businesses, as using them can make a much higher income than before. For example, the ads are only worthwhile when they reach the target audience.

“It's meaningless to persuade singles to buy strollers !”

Traditionally, collaborative filtering methods are the major methods used for a recommendation system, whatever the items are, movies, articles, products, collaborative filtering methods always perform adequately.

Collaborative methods for recommender systems are methods that are "based only on the past interactions recorded between users and items to produce new recommendations” [1]. These interactions are stored in the so-called “user-item interactions matrix” as shown in figure 2.

Figure 2. Illustration of the user-item interactions matrix [2].

For collaborative filtering methods, there are two most wildly used methods, the user-user method and the item-item method (Figure 3).

There is a simple explanation for the user-user method and the item-item method:

user-user method: Because we both like those products, we are similar to each other. I also like that product, so you should like it, too.
item-item method: people who like this product always like that product, because you like this product, you should like that product, too.

Figure3. Illustration of the difference between item-item and user-user methods.

Nowadays, as machine learning algorithms evolve, more and more other useful algorithms arrived and work nicely with recommendation systems. Though there are many subspaces that recommendation systems are very useful, the news recommendation problem is always one of the most attractive ones.

3. News Recommendation Problem

Again, you are in the subway to your company, due to the time is limited, it’s impossible for you to read all the news, you will choose a news service which will help you to choose the recent host news and other news you may be interested in. Traditionally, you could buy a printed newspaper, where the news is selected by human editors. However, using a smartphone will present the news service company your personal information, like your age, your sex, your address, your recently read news, and your time spend on each news... In this case, one person can be recognized as a single user, who has his or her personal preferences. With that information, the news service can present you with more personalized news.

Figure 4 is Yahoo's homepage on smart-phones. In the personalized News Recommendation model, many news items can be found. By clicking one of these tabs, the user will be redirected to the news description page. This form is a general form of news service all over the world, the only difference is the header and the languages. I think all of you are familiar with this smart-phone presentation of news recommendation.

Figure 4. Example of Yahoo! JAPAN’s homepage on smart-phones.

By storing user ID cookies, the news service can recognize a user and personalize the articles for individual users. However, ID-based methods, such as collaborative filtering was suggested not suitable for news recommendation, as Zhong et al. said [3].

“Candidate news articles expired too quickly and were replaced with new ones within short periods"

Thus, the three keys in news recommendations are:

Understanding the content of articles,
Understanding user preferences, and
Listing selected articles for individual users based on content and preferences.

Based on these key points, Shumpei et al. developed their previous version of news recommendation systems. "An article is regarded as a collection of words included in its text. A user is regarded as a collection of words included in articles he/she has browsed [3].” Then they use the user-user method and the item-item method to do news recommendation.

But there are two major negative impacts using this model:

The representation of words: two words that had the same meaning were treated as completely different features if the notations were different.
The handling of browsing histories: it should be handled as a time serials instead of a bag of history because the interest change over time is significant for recommendations.

To resolving these problems, new methods are presented:

Start with distributed representations of articles based on a variant of the denoising autoencoder
Generate user representations by using an RNN with browsing histories as input sequences

4. News Recommendation Process Flow

To make the news recommendation system works, five steps need to be executed to do article selection for millions of users [3]:

Identify: Obtain user features calculated from user history in advance.
Matching: Extract articles from all those available using user features.
Ranking: Rearrange list of articles on certain priorities.
De-duplication: Remove articles that contain the same information as others.
Advertising: Insert ads if necessary.

By analyzing this process, we can find that article representation and user representation are two major tasks that need to be solved. Only if we can represent articles and users as a list of features, we can match them together.

Article Representation

The previous studies use words as features for an article that did not work well in certain cases of extracting and de-duplicating. Shumpei et al. describe a new method of dealing with articles as a distributed representation [3].

As you know, for machine learning, the loss function is the key point that decides how good you can achieve. To find a good representation of the loss function, this paper shows a new way to decide the loss function. As shown in figure 5, for each base article, there are multiple articles, which may be similar or different from the base article. This was done by humans to choose which articles are in the same category, like history, politics, international, and military...

The encoding process works like “compress” a file to a smaller size, and the decoding process works like “decompress” a small representation to a full article. If you don’t familiar with the encoding and decoding process in machine learning, you can simply think of it as the features of the article.

Figure 5: Encoder for triplets of articles

With that knowledge, you can come to consider about the loss function: the articles in a similar category should have a larger similarity than articles in different categories. The loss function is the similarity of different categories minus the similarity of similar categories. By minimizing the loss function, we can find a good representation for encoded articles.

User Representation

The user is more complex than the article. Because users are active and they will always clicking the news that they are interested in, there is a long reading history for users (Figure 6). The user representations need to have the ability to represent the reading history.

Figure 6: Browsing history and session

To embed a user’s reading history to a size-fixed list of numbers, there are many different technics to achieve. The easiest way is using the word-based model, and this model can be set with an average function instead of max or with a decayed average function. A decaying model is also a good way to generate user representation. It uses the weighted average to aggregate browsing histories, instead of the maximum value.

But here we only estimate the recurrent models. Long-short term memory (LSTM) is a well-known structure for vanishing and exploding gradients. Figure 7 has a network image of the structure of the LSTM-based model.

Figure 7: LSTM-based model

There will be a list of user reading history gathered as input of each LSTM unit, and bypassing these news articles reading history to this network, the users’ features can be extracted out as user representations.

5. Offline Test and Deployment

Offline Test

Models and Training

About 12 million users who had clicked at least one article were sampled as the dataset, and there are about 166 million sessions, one billion browses, and two million unique articles. By applying the recommendation system to the dataset, we can do the offline test and compare the result of different models.

This paper uses three popular metrics, i.e., the area under the ROC curve (AUC), mean reciprocal rank (MRR), and normalized discounted cumulative gain (nDCG), which are regarded clicks as positive labels. For all these three metrics, the larger the better. \[ \begin{aligned} \mathrm{AUC} &=\frac{1}{|S|} \sum_{s \in S} \frac{\left|\left\{(i, j) \mid i<j, c_{s, i}=1, c_{s, j}=0\right\}\right|}{\left|\left\{i \mid c_{s, i}=1\right\}\right|\left|\left\{j \mid c_{s, j}=0\right\}\right|} \\ \mathrm{MRR} &=\frac{1}{|S|} \sum_{s \in S} \frac{1}{\min _{c_{s, i}=1} i} \\ \mathrm{nDCG} &=\frac{1}{|S|} \sum_{s \in S} \frac{\sum_{i} c_{s, i} / \log _{2}(i+1)}{\max _{\pi} \sum_{i} c_{s, \pi(i)} / \log _{2}(\pi(i)+1)} \end{aligned} \] AUC is a metric directly related to training objectives. MRR and nDCG are popular ranking indicators; MRR focuses on the first appearance of a positive instance, while nDCG evaluates the rank of all positive instances.

Experimental results

This paper evaluated the eight models listed in Table 1. \[ \begin{array}{c|l} \hline \text { Name } & \text { Description } \\ \hline \hline \text { BoW } & \text { The simplest word-based model } \\ \hline \text { BoW-Ave } & \begin{array}{l} \text { Word-based model that uses average func- } \\ \text { tion instead of max in Eq.4 } \end{array} \\ \hline \text { BoW-Dec } & \begin{array}{l} \text { Word-based model that uses decayed aver- } \\ \text { age function with } \beta=0.9 \text { similar to that } \\ \text { introduced in Section 4.3 } \end{array} \\ \hline \text { Average } & \text { Decaying model with } \beta=1 \text { (no decaying) } \\ \hline \text { Decay } & \text { Decaying model with } \beta=0.9 \\ \hline R N N & \text { Recurrent model using simple RNN unit } \\ \hline \text { LSTM } & \text { Recurrent model using LSTM-based unit }\\ \hline \end{array} \] Table 2 shows results from offline experiments. Values indicate the average of metrics and 99% confidence intervals in ten split test sets. As we already know, for AUC, MRR, nDCG, the larger the better. \[ \begin{array}{c|ccc} \hline & \text { AUC } & \text { MRR } & \text { nDCG } \\ \hline \hline \text { BoW } & 0.582 \pm 0.003 & 0.300 \pm 0.003 & 0.446 \pm 0.002 \\ \text { BoW-Ave } & 0.579 \pm 0.004 & 0.310 \pm 0.003 & 0.452 \pm 0.002 \\ \text { BoW-Dec } & 0.560 \pm 0.004 & 0.297 \pm 0.004 & 0.442 \pm 0.003 \\ \hline \text { Average } & 0.608 \pm 0.003 & 0.313 \pm 0.003 & 0.457 \pm 0.002 \\ \text { Decay } & 0.596 \pm 0.003 & 0.302 \pm 0.002 & 0.449 \pm 0.001 \\ \hline \text { RNN } & 0.612 \pm 0.004 & 0.309 \pm 0.004 & 0.455 \pm 0.003 \\ \text { LSTM } & 0.648 \pm 0.004 & 0.344 \pm 0.004 & 0.481 \pm 0.003 \\ \hline \end{array} \] LSTM is significantly better than Average. This was because it was able to express more complex relations for the order of browsing sequences by using the gate structures in these models.

Deployment

Experimental Results

Yahoo began using this new recommendation system in December 2016 for the news recommendation service. And this paper uses four online metrics:

Sessions: The average number of times one user utilized our service per day.
Duration: The average time (in seconds) that the user spent with our service per session.
Clicks: The average number of clicks per session.
Click-through rate (CTR): Clicks / number of displayed articles.

Figure 8 plots the daily transition in the lift rate of each metric. As you can see, all metrics improved significantly (please pay attention to the y axis), though the improvement rate of sessions is smaller than the other three metrics, it keeps increase as the number of days increases.

Figure 8: Transition in lift rate for metrics.

Also, users can be split into three different types by their frequency of visits:

Heavy: >5 days/week.
Medium: 2-5 days / week.
Light: <2 days / week.

Table 3 summarizes the average metric improvement rates in the seventh week and those by different types of users. \[ \begin{array}{l|r|rrr} \hline \text { Metric } & \text { ALL } & \text { Heavy } & \text { Medium } & \text { Light } \\ \hline \hline \text { Sessions } & +2.3 \% & +1.1 \% & +1.0 \% & +1.8 \% \\ \text { Duration } & +7.8 \% & +4.9 \% & +13.3 \% & +17.4 \% \\ \text { Clicks } & +19.1 \% & +14.3 \% & +26.3 \% & +42.3 \% \\ \text { CTR } & +23.0 \% & +18.7 \% & +29.8 \% & +45.1 \% \\ \hline \end{array} \]

“Improvements in the metrics were confirmed for all segments. Light users demonstrated particularly large improvement rates!”

Challenges for large-scale deployment

Though the results of the proposed machine learning news recommendation model seem good, there are still many challenges that need to be solved.

First, for training a large model, it takes a long time to update all weights. As the model is too large, it may take more than one week to train.

Second, the news article representation can be improved as more data are collected. But after updating the article representation, all previously calculated news articles, need to be recalculated, and this will cost even more time.

Third, when users’ reading history changes, the user representation needs to be recalculated, and these changes always happen, which means the user representation needs to be recalculated constantly.

Finally, the update of news article representation and user representation need to process simultaneously, or the system will not match.

One lazy solution is to generate a new model every week and directly switch these models every week. But this is not a good solution because it can not keep up with the users’ change of interests.

6. Summary

This paper shows a novel news recommendation system with a large advance than before. Though there are some drawbacks, it’s still a large step in this machine learning field. More and more people stop read newspapers nowadays because they believe that online news services work better than the old fashion newspapers with human editors. But the recommendation systems may prompt bias due to users’ reading history, which can not be ignored in sketching recommendation systems.

Though there are still many drawbacks to recommendation systems, the domain is developing fastly, and these challenges will be well dealt with in the future. If you think this field is attractive to you, you should go deeper and build a full knowledge base on the recommendation system. If you need some help, feel free to contact me at bh2283@nyu.edu. Also, I would be exhilarated to receive some feedback from you for drafting easy-to-understand blog articles.

references

[1] F. O. Isinkaye, Y. O. Folajimi, and B. A. Ojokoh, “Recommendation systems: Principles, methods and evaluation,” Egyptian Informatics Journal, vol. 16, no. 3, pp. 261–273, Nov. 2015, doi: 10.1016/j.eij.2015.06.005.

[2] B. Rocca, “Introduction to recommender systems,” Medium, Jun. 12, 2019. https://towardsdatascience.com/introduction-to-recommender-systems-6c66cf15ada (accessed Mar. 28, 2021).

[3] S. Okura, Y. Tagami, S. Ono, and A. Tajima, “Embedding-based News Recommendation for Millions of Users,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, Aug. 2017, pp. 1933–1942, doi: 10.1145/3097983.3098108.