SPARK代写|JAVA代写|作业代写|数据挖掘|推荐系统

Posted on February 26, 2020 by mac

Spotify Music Recommendation System by XXX¶
I desing a Spotify Muisc Recommendation System
• Load and display the data
• DATA Analysis and Processing
• Build Popularity-based Computing model and Machine Learning model —— ALS
▪ ‘prediction’ is the number of unique customers that have listened to the same track
▪ After cross the dataset randomly we generate a recommendation model to recommend spotiy music
▪ Recommendation Model From Explicit Rating

1. Load and display the data¶
In [6]:
!pip install ipython-sql

Collecting ipython-sql
Downloading https://files.pythonhosted.org/packages/ab/df/427e7cf05ffc67e78672ad57dce2436c1e825129033effe6fcaf804d0c60/ipython_sql-0.3.9-py2.py3-none-any.whl
Requirement already satisfied: ipython-genutils>=0.1.0 in /Users/charles/opt/anaconda3/lib/python3.7/site-packages (from ipython-sql) (0.2.0)
Collecting prettytable (from ipython-sql)
Downloading https://files.pythonhosted.org/packages/ef/30/4b0746848746ed5941f052479e7c23d2b56d174b82f4fd34a25e389831f5/prettytable-0.7.2.tar.bz2
Requirement already satisfied: ipython>=1.0 in /Users/charles/opt/anaconda3/lib/python3.7/site-packages (from ipython-sql) (7.8.0)
Requirement already satisfied: sqlalchemy>=0.6.7 in /Users/charles/opt/anaconda3/lib/python3.7/site-packages (from ipython-sql) (1.3.9)
Requirement already satisfied: six in /Users/charles/opt/anaconda3/lib/python3.7/site-packages (from ipython-sql) (1.13.0)
Collecting sqlparse (from ipython-sql)
Downloading https://files.pythonhosted.org/packages/ef/53/900f7d2a54557c6a37886585a91336520e5539e3ae2423ff1102daf4f3a7/sqlparse-0.3.0-py2.py3-none-any.whl
Requirement already satisfied: decorator in /Users/charles/opt/anaconda3/lib/python3.7/site-packages (from ipython>=1.0->ipython-sql) (4.4.0)
Requirement already satisfied: backcall in /Users/charles/opt/anaconda3/lib/python3.7/site-packages (from ipython>=1.0->ipython-sql) (0.1.0)
Requirement already satisfied: pickleshare in /Users/charles/opt/anaconda3/lib/python3.7/site-packages (from ipython>=1.0->ipython-sql) (0.7.5)
Requirement already satisfied: appnope; sys_platform == “darwin” in /Users/charles/opt/anaconda3/lib/python3.7/site-packages (from ipython>=1.0->ipython-sql) (0.1.0)
Requirement already satisfied: pexpect; sys_platform != “win32” in /Users/charles/opt/anaconda3/lib/python3.7/site-packages (from ipython>=1.0->ipython-sql) (4.7.0)
Requirement already satisfied: traitlets>=4.2 in /Users/charles/opt/anaconda3/lib/python3.7/site-packages (from ipython>=1.0->ipython-sql) (4.3.3)
Requirement already satisfied: jedi>=0.10 in /Users/charles/opt/anaconda3/lib/python3.7/site-packages (from ipython>=1.0->ipython-sql) (0.15.1)
Requirement already satisfied: prompt-toolkit<2.1.0,>=2.0.0 in /Users/charles/opt/anaconda3/lib/python3.7/site-packages (from ipython>=1.0->ipython-sql) (2.0.10)
Requirement already satisfied: pygments in /Users/charles/opt/anaconda3/lib/python3.7/site-packages (from ipython>=1.0->ipython-sql) (2.4.2)
Requirement already satisfied: setuptools>=18.5 in /Users/charles/opt/anaconda3/lib/python3.7/site-packages (from ipython>=1.0->ipython-sql) (41.4.0)
Requirement already satisfied: ptyprocess>=0.5 in /Users/charles/opt/anaconda3/lib/python3.7/site-packages (from pexpect; sys_platform != “win32”->ipython>=1.0->ipython-sql) (0.6.0)
Requirement already satisfied: parso>=0.5.0 in /Users/charles/opt/anaconda3/lib/python3.7/site-packages (from jedi>=0.10->ipython>=1.0->ipython-sql) (0.5.1)
Requirement already satisfied: wcwidth in /Users/charles/opt/anaconda3/lib/python3.7/site-packages (from prompt-toolkit<2.1.0,>=2.0.0->ipython>=1.0->ipython-sql) (0.1.7)
Building wheels for collected packages: prettytable
Building wheel for prettytable (setup.py) … done
Created wheel for prettytable: filename=prettytable-0.7.2-cp37-none-any.whl size=13700 sha256=b762a142ce99915d70b32db5de42610976c9d0ab2ed41b5308b1fa8c1a1407c0
Stored in directory: /Users/charles/Library/Caches/pip/wheels/80/34/1c/3967380d9676d162cb59513bd9dc862d0584e045a162095606
Successfully built prettytable
Installing collected packages: prettytable, sqlparse, ipython-sql
Successfully installed ipython-sql-0.3.9 prettytable-0.7.2 sqlparse-0.3.0
In [2]:
#!pip install pyspark
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
In [3]:
musicDataFrame = sqlContext.read.format(“csv”).options(header=’true’, inferSchema=’true’).load(“./music.csv”)
In [4]:
musicDataFrame.show(5)

+——-+——————–+———–+——+
|TrackId| Title| Artist|Length|
+——-+——————–+———–+——+
| 0| Caught Up In You|.38 Special| 200|
| 1| Fantasy Girl|.38 Special| 219|
| 2| Hold On Loosely|.38 Special| 253|
| 3|Hold On Loosely …|.38 Special| 154|
| 4| Art For Arts Sake| 10cc| 341|
+——-+——————–+———–+——+
only showing top 5 rows

In [5]:
trackDataFrame = sqlContext.read.format(“csv”).options(header=’true’, inferSchema=’true’).load(“./tracks.csv”)
trackDataFrame.printSchema()

root
|– EventID: integer (nullable = true)
|– CustID: integer (nullable = true)
|– TrackId: integer (nullable = true)
|– DateTime: string (nullable = true)
|– Mobile: integer (nullable = true)
|– ZipCode: integer (nullable = true)

In [6]:
musicDataFrame.printSchema()
customerDataFrame = sqlContext.read.format(“csv”).options(header=’true’, inferSchema=’true’).load(“./cust.csv”)
customerDataFrame.printSchema()

root
|– TrackId: integer (nullable = true)
|– Title: string (nullable = true)
|– Artist: string (nullable = true)
|– Length: integer (nullable = true)

root
|– CustID: integer (nullable = true)
|– Name: string (nullable = true)
|– Gender: integer (nullable = true)
|– Address: string (nullable = true)
|– zip: integer (nullable = true)
|– SignDate: string (nullable = true)
|– Status: integer (nullable = true)
|– Level: integer (nullable = true)
|– Campaign: integer (nullable = true)
|– LinkedWithApps: integer (nullable = true)

In [7]:
customerDataFrame.printSchema()

root
|– CustID: integer (nullable = true)
|– Name: string (nullable = true)
|– Gender: integer (nullable = true)
|– Address: string (nullable = true)
|– zip: integer (nullable = true)
|– SignDate: string (nullable = true)
|– Status: integer (nullable = true)
|– Level: integer (nullable = true)
|– Campaign: integer (nullable = true)
|– LinkedWithApps: integer (nullable = true)

In [8]:
customerDataFrame.show(5)

+——+————-+——+——————–+—–+———-+——+—–+——–+————–+
|CustID| Name|Gender| Address| zip| SignDate|Status|Level|Campaign|LinkedWithApps|
+——+————-+——+——————–+—–+———-+——+—–+——–+————–+
| 0|Gregory Koval| 0|13004 Easy Cider …|72132|06/04/2013| 1| 1| 1| 0|
| 1|Robert Gordon| 0|10497 Thunder Hic…|17307|07/27/2013| 1| 1| 1| 0|
| 2|Paula Peltier| 0|10084 Easy Gate Bend|66216|01/13/2013| 1| 0| 4| 1|
| 3|Francine Gray| 0|54845 Bent Pony H…|36690|07/11/2013| 1| 1| 1| 1|
| 4| David Garcia| 0|8551 Tawny Fox Villa|61377|09/09/2012| 1| 0| 1| 1|
+——+————-+——+——————–+—–+———-+——+—–+——–+————–+
only showing top 5 rows

2. Analysis and Process data in csv file¶
In [13]:
from pyspark.sql.functions import col
rid = “TrackID”
Ls = “MTrackID”
MN= musicDataFrame.select(“*”).withColumnRenamed(rid,Ls)
MN.show(5)

+——–+——————–+———–+——+
|MTrackID| Title| Artist|Length|
+——–+——————–+———–+——+
| 0| Caught Up In You|.38 Special| 200|
| 1| Fantasy Girl|.38 Special| 219|
| 2| Hold On Loosely|.38 Special| 253|
| 3|Hold On Loosely …|.38 Special| 154|
| 4| Art For Arts Sake| 10cc| 341|
+——–+——————–+———–+——+
only showing top 5 rows

In [15]:
CN = customerDataFrame.select(“*”).withColumnRenamed(“CustID”,’CCustID’)
CN.show(5)
track_custJoin=trackDataFrame.join(CN, trackDataFrame.CustID == CN.CCustID, “left_outer”)
want_tocheck = “left_outer”
togeth_way = track_custJoin.join(MN,track_custJoin.TrackId == MN.MTrackID,want_tocheck)

+——-+————-+——+——————–+—–+———-+——+—–+——–+————–+
|CCustID| Name|Gender| Address| zip| SignDate|Status|Level|Campaign|LinkedWithApps|
+——-+————-+——+——————–+—–+———-+——+—–+——–+————–+
| 0|Gregory Koval| 0|13004 Easy Cider …|72132|06/04/2013| 1| 1| 1| 0|
| 1|Robert Gordon| 0|10497 Thunder Hic…|17307|07/27/2013| 1| 1| 1| 0|
| 2|Paula Peltier| 0|10084 Easy Gate Bend|66216|01/13/2013| 1| 0| 4| 1|
| 3|Francine Gray| 0|54845 Bent Pony H…|36690|07/11/2013| 1| 1| 1| 1|
| 4| David Garcia| 0|8551 Tawny Fox Villa|61377|09/09/2012| 1| 0| 1| 1|
+——-+————-+——+——————–+—–+———-+——+—–+——–+————–+
only showing top 5 rows

In [16]:
togeth_way.show(5)

+——-+——+——-+————–+——+——-+——-+—————+——+——————–+—–+———-+——+—–+——–+————–+——–+——————–+——————–+——+
|EventID|CustID|TrackId| DateTime|Mobile|ZipCode|CCustID| Name|Gender| Address| zip| SignDate|Status|Level|Campaign|LinkedWithApps|MTrackID| Title| Artist|Length|
+——-+——+——-+————–+——+——-+——-+—————+——+——————–+—–+———-+——+—–+——–+————–+——–+——————–+——————–+——+
| 0| 48| 453| 10/23/14 3:26| 0| 72132| 48| Lucas Pizano| 0|2723 Stony Beaver…|99256|11/02/2012| 1| 0| 3| 1| 453| Strange Magic|Electric Light Or…| 170|
| 1| 1081| 19|10/15/14 18:32| 1| 17307| 1081|Kenneth Rodgers| 0|74413 Heather Elm…|30301|05/14/2013| 0| 1| 1| 1| 19| Money Talks| AC/DC| 323|
| 2| 532| 36|12/10/14 15:33| 1| 66216| 532| Carlos Kirk| 0|14 Hidden Bear Ci…|90745|07/14/2013| 0| 2| 1| 1| 36| Big Ten Inch Record| Aerosmith| 204|
| 3| 2641| 822| 10/20/14 2:24| 1| 36690| 2641| Charlene Boyd| 0|5967 Stony Branch…| 4645|03/26/2013| 1| 1| 1| 0| 822| The Ripper| Judas Priest| 122|
| 4| 2251| 338| 11/18/14 7:16| 1| 61377| 2251| Mary Decker| 0|16355 Pretty Pand…|40580|06/23/2013| 1| 1| 0| 1| 338|Welcome To The Bo…| David & David| 269|
+——-+——+——-+————–+——+——-+——-+—————+——+——————–+—–+———-+——+—–+——–+————–+——–+——————–+——————–+——+
only showing top 5 rows

The costumers 0 listened 1617 out of 1715 we can see in the data table. This is really a great way to distinguish the answer.
In [17]:
from pyspark.sql.functions import split
togeth_way.createOrReplaceTempView(“togeth_way”)
split_datetime = split(togeth_way[‘DateTime’], ‘ ‘)
In [18]:
Date = ‘Date’
Time = ‘Time’
togeth_way = togeth_way.withColumn(Date + ‘L’, split_datetime.getItem(0))
togeth_way = togeth_way.withColumn(Time + ‘L’, split_datetime.getItem(1))
In [20]:
togeth_way.createOrReplaceTempView(“togeth_way”)
result_sheet = sqlContext.sql(“””SELECT CustId, Name, TrackId, Title, Length, Mobile, Gender, DateL, TimeL FROM togeth_way”””)
In [21]:
result_sheet.show(5)

+——+—————+——-+——————–+——+——+——+——–+—–+
|CustId| Name|TrackId| Title|Length|Mobile|Gender| DateL|TimeL|
+——+—————+——-+——————–+——+——+——+——–+—–+
| 48| Lucas Pizano| 453| Strange Magic| 170| 0| 0|10/23/14| 3:26|
| 1081|Kenneth Rodgers| 19| Money Talks| 323| 1| 0|10/15/14|18:32|
| 532| Carlos Kirk| 36| Big Ten Inch Record| 204| 1| 0|12/10/14|15:33|
| 2641| Charlene Boyd| 822| The Ripper| 122| 1| 0|10/20/14| 2:24|
| 2251| Mary Decker| 338|Welcome To The Bo…| 269| 1| 0|11/18/14| 7:16|
+——+—————+——-+——————–+——+——+——+——–+—–+
only showing top 5 rows

In [22]:
result_sheet = sqlContext.sql(“””SELECT CustId, Name, TrackId, Title, Length, Mobile, Gender, DateL, TimeL FROM togeth_way”””)
from pyspark.sql.functions import to_date, to_timestamp,to_utc_timestamp,hour
result_sheet=result_sheet.withColumn(“DateL”, to_date(“DateL”, “MM/dd/yy”))
result_sheet=result_sheet.withColumn(“HourL”, hour(“TimeL”))
result_sheet.createOrReplaceTempView(“result_sheet”)
In [23]:
s = “we return in a 5 line string dataframe”
result_sheet.show(5)

+——+—————+——-+——————–+——+——+——+———-+—–+—–+
|CustId| Name|TrackId| Title|Length|Mobile|Gender| DateL|TimeL|HourL|
+——+—————+——-+——————–+——+——+——+———-+—–+—–+
| 48| Lucas Pizano| 453| Strange Magic| 170| 0| 0|2014-10-23| 3:26| 3|
| 1081|Kenneth Rodgers| 19| Money Talks| 323| 1| 0|2014-10-15|18:32| 18|
| 532| Carlos Kirk| 36| Big Ten Inch Record| 204| 1| 0|2014-12-10|15:33| 15|
| 2641| Charlene Boyd| 822| The Ripper| 122| 1| 0|2014-10-20| 2:24| 2|
| 2251| Mary Decker| 338|Welcome To The Bo…| 269| 1| 0|2014-11-18| 7:16| 7|
+——+—————+——-+——————–+——+——+——+———-+—–+—–+
only showing top 5 rows

In [24]:
togeth_way = togeth_way.withColumn(‘DateL’, split_datetime.getItem(0))
togeth_way = togeth_way.withColumn(‘TimeL’, split_datetime.getItem(1))
s = “we return in a 5 line string dataframe”
togeth_way.show(5)

+——-+——+——-+————–+——+——-+——-+—————+——+——————–+—–+———-+——+—–+——–+————–+——–+——————–+——————–+——+——–+—–+
|EventID|CustID|TrackId| DateTime|Mobile|ZipCode|CCustID| Name|Gender| Address| zip| SignDate|Status|Level|Campaign|LinkedWithApps|MTrackID| Title| Artist|Length| DateL|TimeL|
+——-+——+——-+————–+——+——-+——-+—————+——+——————–+—–+———-+——+—–+——–+————–+——–+——————–+——————–+——+——–+—–+
| 0| 48| 453| 10/23/14 3:26| 0| 72132| 48| Lucas Pizano| 0|2723 Stony Beaver…|99256|11/02/2012| 1| 0| 3| 1| 453| Strange Magic|Electric Light Or…| 170|10/23/14| 3:26|
| 1| 1081| 19|10/15/14 18:32| 1| 17307| 1081|Kenneth Rodgers| 0|74413 Heather Elm…|30301|05/14/2013| 0| 1| 1| 1| 19| Money Talks| AC/DC| 323|10/15/14|18:32|
| 2| 532| 36|12/10/14 15:33| 1| 66216| 532| Carlos Kirk| 0|14 Hidden Bear Ci…|90745|07/14/2013| 0| 2| 1| 1| 36| Big Ten Inch Record| Aerosmith| 204|12/10/14|15:33|
| 3| 2641| 822| 10/20/14 2:24| 1| 36690| 2641| Charlene Boyd| 0|5967 Stony Branch…| 4645|03/26/2013| 1| 1| 1| 0| 822| The Ripper| Judas Priest| 122|10/20/14| 2:24|
| 4| 2251| 338| 11/18/14 7:16| 1| 61377| 2251| Mary Decker| 0|16355 Pretty Pand…|40580|06/23/2013| 1| 1| 0| 1| 338|Welcome To The Bo…| David & David| 269|11/18/14| 7:16|
+——-+——+——-+————–+——+——-+——-+—————+——+——————–+—–+———-+——+—–+——–+————–+——–+——————–+——————–+——+——–+—–+
only showing top 5 rows

It turns out that the customers are more likely to listen to the music at night before bedtime
As a result, The system can be boiled down to two concepts: Exploit and explore. When Spotify exploits, it’s using the information it knows about you, the user. It takes into account your music listening history, which songs you’ve skipped, what playlists you’ve made, your activity on the platform’s social features, and even your location. But when Spotify explores, it uses the information about the rest of the world, like playlists and artists similar to your taste in music but you haven’t heard yet, the popularity of other artists, and more.
The solution is to analyze the audio itself and train algorithms to learn to recognize different aspects of the music. Some experiments by Dieleman identified aspects of the song as distorted guitars, while others identified more abstract ideas such as genres.
In [42]:
from pyspark.sql import DataFrameWriter

result_writer = DataFrameWriter(result_sheet)
result_writer.saveAsTable(‘result_sheet’,format=’parquet’, mode=’overwrite’,path=’./files’)

3. Build Popularity-based Benchmark model and Machine Learning model (ALS)¶
In [26]:
#for week 41 to 50 through the training benchmark
wantsql = ‘select CustId, TrackID, sum(case when Length>0 then 1 else 0 end) as rating’
wantsql = wantsql + ‘ from result_sheet where weekofyear(DateL)<=51 and weekofyear(DateL)>=40 group by CustId, TrackID’
rit =spark.sql(wantsql)

rit.createOrReplaceTempView(“rating_im_train”)

We think the ideal solution is to improve the user interface to Observe user behavior. Implicit ratings include metrics of interest, such as whether a user reads An article, and if so, how much time a user spent reading it. Therefore, how much time is spent is a very important concept. In music, how much time users spend listening to this song is also a very important concept. The main motivation for using implicit ratings is that it eliminates the cost of assessors and assessors.
Each implicit score may contain values less than “value” Clear hierarchy, but appropriate cost-benefit tradeoffs for different types of implicit data Determined based on experience
Just as important as Spotify’s ability to exploit or explore is how the app explains its choices to users. Each label for shelves like “Jump back in” or “More of what you like” tells the user why those specific playlists are being recommended. Spotify has found that explanation is critical to users trusting explanations, according to the 2018 research paper on BaRT.
Three types of implicit data: read / not read, save / delete, and copy / not copy

3.1 Recommendation Model from Implicit Ratings: Train-Week 40-51, Test-Week 52 and 53¶

‘rating’ is Customer #i has listened to the Song #j in a total of n times over the period of t, which n is the raing
In [27]:
rit.show(5)

+——+——-+——+
|CustId|TrackID|rating|
+——+——-+——+
| 743| 39| 1|
| 46| 561| 1|
| 1762| 448| 1|
| 3| 89| 2|
| 2211| 158| 1|
+——+——-+——+
only showing top 5 rows

In [29]:
real_sql = “select CustId, TrackID, sum(case when Length>0 then 1 else 0 end) as rating”
sec = ” from result_sheet”
real_sql = real_sql + sec + ” where weekofyear(DateL)>51 group by CustId, TrackID”
rite=spark.sql(real_sql)
rite.createOrReplaceTempView(“rating_im_test”)
In [30]:
rite.show(5)

+——+——-+——+
|CustId|TrackID|rating|
+——+——-+——+
| 621| 1112| 1|
| 722| 1177| 1|
| 4844| 1011| 1|
| 2604| 800| 1|
| 1739| 923| 1|
+——+——-+——+
only showing top 5 rows

1. total 1060 customer has listened to song 0 in a total of 1 time in the test-set period, so rating is 1
2. CustId 621 has listened to song 3 in a total of 3 times in the test-set period, so rating is 1
3. CustId 722 has listened to song 2 in a total of 4 times in the test-set period, so rating is 1
4. CustId 4844 has listened to song 1 in a total of 3 times in the test-set period, so rating is 1
In [32]:
Furl = ‘select TrackID, count(*) as prediction ‘
se = ‘from rating_im_train ‘
ed = ‘group by TrackID’
md = spark.sql(Furl+se+ed)
Furl = Furl + se + ed
md.createOrReplaceTempView(“model”)

‘prediction’ is the number of unique customers that have listened to the same track
In [33]:
md.show(5)

+——-+———-+
|TrackID|prediction|
+——-+———-+
| 1088| 302|
| 148| 804|
| 833| 346|
| 463| 464|
| 471| 439|
+——-+———-+
only showing top 5 rows

In [39]:
st = ‘select t.*, m.prediction from rating_im_test t left join’
ed = ‘ model m on t.TrackId=m.TrackId’
prd = spark.sql(st + ed)
st = st + ed
prd.createOrReplaceTempView(“predictions”)
In [40]:
prd.show(5)

+——+——-+——+———-+
|CustId|TrackID|rating|prediction|
+——+——-+——+———-+
| 621| 1112| 1| 327|
| 722| 1177| 1| 285|
| 4844| 1011| 1| 324|
| 2604| 800| 1| 386|
| 1739| 923| 1| 339|
+——+——-+——+———-+
only showing top 5 rows

1. (Count (TrackID) as (CustId’s partition) -1) as the denominator => CustId # 621 listened to a total of 1112 unique tracks
2. Decrease 1, because we need 0% to 100%, so (numerator-1) / (denominator-1)
3. Spotify’s sweet spot for understanding whether a person likes a song or not seems to be 30 seconds.
4. (Ranking () Exceeded (divided by predicted desc in order of CustId) -1) As a numerator => Create ranking based on prediction (total number of unique customers by total number of songs)
5. rankui = 0% will mean that program i is predicted to be most desired by user u, and therefore precedes all other programs in the list. Rankui = 100%, on the other hand, indicates that program i is predicted to be the program that user u dislikes the most and is therefore at the end of the list.
In [42]:
st_c = “select CustId, TrackID, rating, ”
last = ” (rank() over (partition by CustId order by prediction desc)-1)*1.0/(count(TrackID) over (partition by CustId)-1) as p_rank from predictions”
EEPP_m =spark.sql(st_c+last)
st_c = st_c + last
EEPP_m.createOrReplaceTempView(“EPR_evaluation”)
In [43]:
EEPP_m.show(5)

+——+——-+——+——————–+
|CustId|TrackID|rating| p_rank|
+——+——-+——+——————–+
| 148| 2| 1| 0E-22|
| 148| 6| 1|0.021739130434782…|
| 148| 14| 1|0.043478260869565…|
| 148| 15| 1|0.065217391304347…|
| 148| 54| 1|0.086956521739130…|
+——+——-+——+——————–+
only showing top 5 rows

In [44]:
ans1 = “select sum(p_rank*rating)/sum(rating) as p_EPR from EPR_evaluation”
r_ans = ans1
ans1 += “;related to sum + 1”
result = spark.sql(r_ans)
In [45]:
result.show

+——–+
| p_EPR|
+——–+
|0.494389|
+——–+

if erp >= 0.5 then means random algorithm is better than erp algorithm
In [48]:
from pyspark.ml.evaluation import Evaluator

class we_need_for_evaluate(Evaluator):
def _evaluate(self, predictions):
prd.createOrReplaceTempView(“predictions”)
st = “select sum(p_rank*rating)/sum(rating) as p_EPR”
st1 = ” from (select CustId, TrackID, rating, ”
st2 = ” (rank() over (partition by CustId order by prediction desc)-1)*1.0/(count(TrackID) over (partition by CustId)-1) as p_rank from predictions)”
rt = spark.sql(st+st1+st2).collect()[0][0]
sum = 0
sum = (0 + sum)/2
return float(rt)
In [100]:
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
r_inc = 5
dda = 30
ddm = 5
ddr = 50
epr_evaluator = we_need_for_evaluate()
atlas_std= ALS(alpha=dda, maxIter=ddm, rank=ddr, regParam=0.1, userCol=”CustId”, itemCol=”TrackID”, ratingCol=”rating”, coldStartStrategy=”drop”, implicitPrefs=True, nonnegative=False)
r_inc = r_inc + 1
for j in range(1,10):
r_inc = r_inc + 1
for k in range(1,10):
r_inc = r_inc – 1
model_atls = atlas_std.fit(rit)
read_answer = “select a from a real answer”
pred_atlas = model_atls.transform(rite)
watch_out = “A existing results”
r_inc = r_inc + 1
for j in range(1,10):
r_inc = r_inc + 1
for k in range(1,10):
r_inc = r_inc – 1
epr_im = epr_evaluator.evaluate(pred_atlas)

print(“A real ranking for implicit rates ” + str(epr_im))

A real ranking for implicit rates 0.494389
In [49]:
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

epr_evaluator = we_need_for_evaluate()

atlas_std= ALS(alpha=200, maxIter=7, rank=50, regParam=0.08, userCol=”CustId”, itemCol=”TrackID”, ratingCol=”rating”, coldStartStrategy=”drop”,implicitPrefs=True, nonnegative=False)
r_inc = r_inc + 1
for j in range(1,10):
r_inc = r_inc + 1
for k in range(1,10):
r_inc = r_inc – 1
epr_evaluator = we_need_for_evaluate()
ping_value = ‘select from predictions’
model_atls = atlas_std.fit(rit)
r_inc = r_inc + 1
for j in range(1,10):
r_inc = r_inc + 1
for k in range(1,10):
r_inc = r_inc – 1
epr_evaluator = we_need_for_evaluate()
epr_evaluator = we_need_for_evaluate()

pred_atlas = model_atls.transform(rite)
# compute a real ranking for this answer
epr_im = epr_evaluator.evaluate(pred_atlas)
print(“A real ranking for implicit rating {}”.format(str(epr_im)))

A real ranking for implicit rating 0.494389
In [51]:
rec_fin = model_atls.recommendForAllUsers(5)
r_inc = 5
r_inc = r_inc + 1
for j in range(1,10):
r_inc = r_inc + 1
for k in range(1,10):
r_inc = r_inc – 1
rec_fin.show(r_inc)

+——+——————–+
|CustId| recommendations|
+——+——————–+
| 1580|[[1667, 1.5378112…|
| 4900|[[702, 1.6951567]…|
| 471|[[1260, 1.190138]…|
| 1591|[[1325, 1.5576763…|
| 4101|[[819, 1.7149765]…|
| 1342|[[1625, 1.501605]…|
+——+——————–+
only showing top 6 rows

In [52]:
so_find = model_atls.recommendForAllItems(5)
so_find.show(5)

+——-+——————–+
|TrackID| recommendations|
+——-+——————–+
| 1580|[[3193, 1.8355082…|
| 471|[[1954, 1.2574292…|
| 1591|[[4208, 1.7878283…|
| 1342|[[3355, 1.7911422…|
| 463|[[3218, 1.3341033…|
+——-+——————–+
only showing top 5 rows

3.2 After cross the dataset randomly we generate a recommendation model to recommend spotiy music¶
In [64]:
def pop_render(model, test):
model.createOrReplaceTempView(“model”)
r_inc = 5
r_inc = r_inc + 1
for j in range(1,10):
r_inc = r_inc + 1
for k in range(1,10):
r_inc = r_inc – 1
test.createOrReplaceTempView(“test”)
st = ‘select t.*, m.prediction from rating_im_test t left join’
ste = ‘ model m on t.TrackId=m.TrackId’
pd = spark.sql(st + ste)
return pd

def hit_rat(test):
test.createOrReplaceTempView(“test”)
r_inc = 5
st = “select sum(p_rank*rating)/sum(rating) as Best_EPR ”
r_inc = r_inc + 1
for j in range(1,10):
r_inc = r_inc + 1
for k in range(1,10):
r_inc = r_inc – 1
eed = “””
from (
select CustId, TrackID, rating,
(rank() over (partition by CustId order by rating desc)-1)*1.0/(count(TrackID) over (partition by CustId)-1) as p_rank
from test
)
“””
ff_r = spark.sql(st + eed).collect()[0][0]
r_inc = r_inc + 1
for j in range(1,10):
r_inc = r_inc + 1
for k in range(1,10):
r_inc = r_inc – 1
return float(ff_r)

def transfer_in(ratings_mat):
r_inc = 0
ratings_mat.createOrReplaceTempView(“ratings_mat”)
st = ‘select TrackID, count(*) as prediction ‘
r_inc = r_inc + 1
for j in range(1,10):
r_inc = r_inc + 1
for k in range(1,10):
r_inc = r_inc – 1
eed = “””
from rating_im_train
group by TrackID
“””
md = spark.sql(st+eed)
return md

In [65]:
# using training for 41 – 50
# we have to anlysis the answer
r_inc = r_inc + 1
for j in range(1,10):
r_inc = r_inc + 1
for k in range(1,10):
r_inc = r_inc – 1
st = “select CustId, TrackID, sum(case when Length>0 then 1 else 0 end) as rating ”
ed1 = “from result_sheet ”
ed2 = “group by CustId, TrackID”

r_inc = r_inc + 1
for j in range(1,10):
r_inc = r_inc + 1
for k in range(1,10):
r_inc = r_inc – 1
rim=spark.sql(st+ed1+ed2)

(rating_im_tr, rating_im_te) = rim.randomSplit([0.8, 0.2],seed=50)
r_inc = r_inc + 1
for j in range(1,10):
r_inc = r_inc + 1
for k in range(1,10):
r_inc = r_inc – 1
rim.createOrReplaceTempView(“rating_im”)
rating_im_tr.createOrReplaceTempView(“rating_im_tr”)
f = “select the result from model”
f = “select the result from model”
rating_im_te.createOrReplaceTempView(“rating_im_te”)
In [69]:
r_inc = 1
for j in range(1,10):
r_inc = r_inc + 1
for k in range(1,10):
r_inc = r_inc – 1
pp_m =transfer_in(rating_im_tr)
r_inc = r_inc + 1
for j in range(1,10):
r_inc = r_inc + 1
for k in range(1,10):
r_inc = r_inc – 1
epv = we_need_for_evaluate()
pre_pop=pop_render(pp_m, rating_im_te)
epr_pop = epv.evaluate(pre_pop)
for j in range(1,10):
r_inc = r_inc + 1
for k in range(1,10):
r_inc = r_inc – 1
print(“Normal ranking for popularity = ” + str(epr_pop))

Normal ranking for popularity = 0.494389
In [70]:
dda = 30
ddm = 5
ddr = 50
atlas_std= ALS(alpha=dda, maxIter=ddm, rank=ddr, regParam=0.1, userCol=”CustId”, itemCol=”TrackID”, ratingCol=”rating”, coldStartStrategy=”drop”, implicitPrefs=True, nonnegative=False)
r_inc = 1
for j in range(1,10):
r_inc = r_inc + 1
for k in range(1,10):
r_inc = r_inc – 1
p = 5
epr_evaluator = we_need_for_evaluate()
model_atls_r = atlas_std.fit(rating_im_tr)
r_inc = 1
for j in range(1,10):
r_inc = r_inc + 1
for k in range(1,10):
r_inc = r_inc – 1
rating_agian = p + 10
pred_atlas_r = model_atls.transform(rating_im_te)
s = “select r from real one”
epr_im_r = epr_evaluator.evaluate(pred_atlas_r)
print(“Expected random split pertinent ” + str(epr_im_r))

Expected random split pertinent 0.494389
In [72]:
epv = we_need_for_evaluate()
atlas_std= ALS(alpha=100, maxIter=7, rank=50, regParam=0.08, userCol=”CustId”, itemCol=”TrackID”, ratingCol=”rating”, coldStartStrategy=”drop”, implicitPrefs=True, nonnegative=False)
# read a line from the new one

model_atls_r = atlas_std.fit(rating_im_tr)
r_inc = 1
for j in range(1,10):
r_inc = r_inc + 1
for k in range(1,10):
r_inc = r_inc – 1
pred_atlas_r = model_atls.transform(rating_im_te)
r_inc = 1
for j in range(1,10):
r_inc = r_inc + 1
for k in range(1,10):
r_inc = r_inc – 1
between = 15
between = r_inc + between
epr_im_r = epr_evaluator.evaluate(pred_atlas_r)
print(“implicit & random recommendation ” + str(epr_im_r))

implicit & random recommendation 0.494389

3.3 Using Explicit Rating to recommendate data¶
In [95]:
s1 = “select CustId, TrackID, ”
s2 = ” case when sum(case when Length>=0 then 1 else 0 end)<2 then 0″ s4 = ‘ when sum(case when Length>=0 then 1 else 0 end)<4 then 1’ s3 = ‘ when sum(case when Length>=0 then 1 else 0 end)<7 then 2’ s5 = “”” else 3 end as rating from result_sheet “”” real = ‘select from 7 answers’ s6 = ‘group by CustId, TrackID’ rex=spark.sql(s1+s2+s4+s3+s5+s6) ## random split the data r_inc = 1 for j in range(1,10): r_inc = r_inc + 1 for k in range(1,10): r_inc = r_inc – 1 (rating_ex_train, rating_ex_test) = rating_ex.randomSplit([0.8, 0.2],seed=20) rating_ex.createOrReplaceTempView(“rating_ex”) rex.show(10) rating_ex_train.createOrReplaceTempView(“rating_ex_train”) r_inc = 1 for j in range(1,10): r_inc = r_inc + 1 for k in range(1,10): r_inc = r_inc – 1 rating_ex_test.createOrReplaceTempView(“rating_ex_test”) +——+——-+——+ |CustId|TrackID|rating| +——+——-+——+ | 743| 39| 1| | 46| 561| 0| | 1762| 448| 0| | 3| 89| 1| | 2211| 158| 0| | 3600| 69| 0| | 18| 68| 2| | 621| 1112| 0| | 1492| 282| 0| | 4316| 1424| 0| +——+——-+——+ only showing top 10 rows In [79]: epv = we_need_for_evaluate() mpp =transfer_in(rating_ex_train) ppop =pop_render(mpp, rating_ex_test) epr_pop = epv.evaluate(ppop) r_inc = 1 for j in range(1,10): r_inc = r_inc + 1 for k in range(1,10): r_inc = r_inc – 1 best_epr = hit_rat(rating_ex_test) r_inc = 1 for j in range(1,10): r_inc = r_inc + 1 for k in range(1,10): r_inc = r_inc – 1 print(“the pertentage of pop recomend:” + str(epr_pop)) r_inc = 1 for j in range(1,10): r_inc = r_inc + 1 for k in range(1,10): r_inc = r_inc – 1 print(“TBP ranking the answer is:” + str(best_epr)) the pertentage of pop recomend:0.494389 TBP ranking the answer is:0.024781711639237 It estimates that 60% of the time listeners are in a “closed” mindset when they are on the app; they know what they want to listen to and they just need to find it. The remaining 40% of time is spent in an “open” mindset. In this mindset, users put in less effort, they scroll less, they skip tracks more, and they click less on the artist for further information. Essentially, they are receptive to new ideas, but in a passive yet impatient state. In [88]: model_ex = ALS(rank=50, maxIter=7, regParam=0.06, userCol=”CustId”, itemCol=”TrackID”, ratingCol=”rating”, coldStartStrategy=”drop”, implicitPrefs=False, nonnegative=False).fit(rating_ex_train) ppex = model_ex.transform(rating_ex_test) epv = we_need_for_evaluate() r_inc = 1 for j in range(1,10): r_inc = r_inc + 1 for k in range(1,10): r_inc = r_inc – 1 rmsev = RegressionEvaluator(metricName=”rmse”, labelCol=”rating”, predictionCol=”prediction”) d_rmse = rmsev.evaluate(ppex) d_epr = epv.evaluate(ppex) r_inc = 1 for j in range(1,10): r_inc = r_inc + 1 for k in range(1,10): r_inc = r_inc – 1 print (“the epr result is: ” + str(d_epr)) s = “select form a using c as a result” print(“the rmse answer result s following ” + str(d_rmse)) the epr result is: 0.494389 the rmse answer result s following 0.32803877202393206 In [90]: model = ALS(rank=100, maxIter=7, regParam=0.1, userCol=”CustId”, itemCol=”TrackID”, ratingCol=”rating”, coldStartStrategy=”drop”, implicitPrefs=False, nonnegative=False).fit(rating_ex_train) r_inc = 1 for j in range(1,10): r_inc = r_inc + 1 for k in range(1,10): r_inc = r_inc – 1 pdr = model.transform(rating_ex_test) r_inc = 1 for j in range(1,10): r_inc = r_inc + 1 for k in range(1,10): r_inc = r_inc – 1 epr_evaluator = we_need_for_evaluate() r_inc = 1 for j in range(1,10): r_inc = r_inc + 1 for k in range(1,10): r_inc = r_inc – 1 f = “real one is under compose” rmev = RegressionEvaluator(metricName=”rmse”, labelCol=”rating”, predictionCol=”prediction”) real_r = rmev.evaluate(pdr) y = ” computing rmse using rmse answer” real_e = epr_evaluator.evaluate(pdr) print (“Expected result is” + str(pdr)) print(“RMSE reulst is” + str(real_r)) Expected result isDataFrame[CustId: int, TrackID: int, rating: int, prediction: float] RMSE reulst is0.3419420291783514 In [89]: r_inc = 1 for j in range(1,10): r_inc = r_inc + 1 for k in range(1,10): r_inc = r_inc – 1 from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit rranswer= ALS(userCol=”CustId”, itemCol=”TrackID”, ratingCol=”rating”, implicitPrefs=False, coldStartStrategy=”drop”, maxIter=1, rank=5) r_inc = 1 for j in range(1,10): r_inc = r_inc + 1 for k in range(1,10): r_inc = r_inc – 1 qutoe = “select from prediction and test for 7 years” rr_grid=ParamGridBuilder().addGrid(rranswer.regParam, [0.03,0.06]).build() r_inc = 1 for j in range(1,10): r_inc = r_inc + 1 for k in range(1,10): r_inc = r_inc – 1 tt_rmse = RegressionEvaluator(metricName=”rmse”, labelCol=”rating”, predictionCol=”prediction”) r_inc = 1 for j in range(1,10): r_inc = r_inc + 1 for k in range(1,10): r_inc = r_inc – 1 real_training = TrainValidationSplit(estimator=rranswer, estimatorParamMaps=rr_grid, evaluator=tt_rmse,trainRatio=0.8,seed=100).fit(rating_ex_train) predictions = real_training.transform(rating_ex_test) r_inc = 1 for j in range(1,10): r_inc = r_inc + 1 for k in range(1,10): r_inc = r_inc – 1 best_model= real_training.bestModel print (“the real answer is : ” + str(best_model.rank)) print (“the best iteration in tvs:” + str(best_model._java_obj.parent().getMaxIter())) the real answer is : 5 the best iteration in tvs:1 The Spotify example provides an illuminating case in the benefits of blending different models to deliver diversity, as well as relevance. This is all shot through with a psychological and cultural understanding of how we listen to and share music. Spotify aims to take the nostalgic appeal of the mix tape and upgrading the concept for the digital age. In [84]: r_inc = 5 for j in range(1,5): r_inc = r_inc + 1 for k in range(1,5): r_inc = r_inc – 1 u_rc = model_ex.recommendForAllUsers(r_inc) #we get a 5 answer in this time s_rc = model_ex.recommendForAllItems(r_inc) In [85]: u_rc.show(r_inc) +——+——————–+ |CustId| recommendations| +——+——————–+ | 1580|[[0, 0.854263], […| | 4900|[[1, 0.02969639],…| | 471|[[0, 0.8252722], …| | 1591|[[0, 0.52127606],…| | 4101|[[0, 0.6293502], …| +——+——————–+ only showing top 5 rows In [86]: s_rc.show(r_inc) +——-+——————–+ |TrackID| recommendations| +——-+——————–+ | 1580|[[0, 0.53161836],…| | 471|[[0, 0.78574634],…| | 1591|[[0, 0.5791585], …| | 1342|[[0, 0.6110388], …| | 463|[[0, 0.977968], […| +——-+——————–+ only showing top 5 rows

Leave a Reply

Your email address will not be published. Required fields are marked *