Building a Reddit Bot that Detects Trash - Python Reddit API Wrapper (PRAW) tutorial p.4




Over the last year or so, I have seen the sharp rise of affiliate marketing spam for Udemy courseson Reddit. The main subreddits that I frequent are /r/python and /r/learnpython. At least on those subreddits, these spam links would actually get upvoted, because the title looked appealing and sounded good. It was clear that many people didn't bother actually looking into what was actually happening.

Many times, the spam would be posted via a submission on Medium, so people were thinking it was some Medium writeup on the topic, when really it was just a course description and a Udemy affiliate link. After getting annoyed enough, I started posting on these threads to just point out these threads were spam. Interestingly enough, this actually seemed to make a huge impact and people started paying attention to it. I even considered about 6 months ago to write a bot to automatically identify posts like these, and just make an automated reply to them to make it more obvious to people that this was just spam for profit, but I always had things that I was more interested in working on.

...Until I saw my own course being pirated, put up for sale on Udemy, and spam-marketed to Reddit. Oh, you know what, I've got a few minutes.

I had always figured this spam was put forth by course creators themselves, but it's become clear to me that these are massive spam rings, doing referral/affiliate spam for profit. The accounts on Reddit appear to be created, and then sit dormant for ~2 months before going active, as if this will somehow make it appear as though they're legit.

My first step is to manually search Google for the Udemy course name, since I am sure this scammer is also a spammer. I search: Mastery Python 3 Basics Tutorial Series

python tutorials

Immediately, I find the Udemy course of course, a medium post linking to the udemy course, and a bunch of spammy discount sites. Okay, hmm, no surprise. Hey, I've seen all that Reddit spam with Udemy courses, I wonder if it's there. Again, another google search for site:reddit.com Mastery Python 3 Basics Tutorial Series

Jackpot!

python tutorials

Okay, let's click on one of those Reddit posts:

python tutorials

Yep, there it is. Hey, I wonder if we just clicked on the name...

python tutorials

Wow, that's a lot of courses.

Okay, now what? Well, we probably do not want to repeat this process via Python, due to the Google search. We could use something like the google-search package, but, having experience with trying to maintain a program that uses Google search, I would like to avoid this at all costs.

You know what, I bet we could do all of this via the Python Reddit API Wrapper. We can use the PRAW to search reddit for phrases like "Udemy" or "Udemy Free," for example. From here, I am going to wager almost all of these are going to be spam posts, given the current state of Udemy spam. That said, some of these might actually be legitimate, and we'd rather not be wrong. How might we identify spam posts and spam authors?

There are obviously *many* ways to do this, but you can usually tell immediately by just looking at the user's profile. My plan is to just find Udemy related threads, then visit the author's profile, and see how much Udemy junk is in there. If more than, say, half of their posts are about Udemy courses, then we're going to call that a spammer, or at least notify people of the situation by posting on the suspicious threads from that author.

Our first step? Well, we need a Reddit account. I've make a new one, calling it Spam_Detector_Bot, a fitting name. Next, we need to go into preferences, then the apps tab. Now, we'll click to create a new app. From here, pick a fitting name, such as Idiot Detector, I don't know, just spit-balling here. Okay, next, we need to pick script, fill in a description, and then add an about and redirect url. Feel free to use your github, some personal site, or you can use https://news.r6siege.cn. When you're done, create the app.

Now, I will create a new .py file, calling it praw_creds.py. Inside it:

client_id = ''
client_secret = ''
password = ''
user_agent = ''
username = ''

From the API information, fill this out, save, and close it.

Alright, next, I'll create a new python file, calling it to_catch_a_spammer.py. To begin, let's work on the functionality to use search. If you haven't yet, do a pip install praw.

to_catch_a_spammer.py
import praw
from praw_creds import client_id, client_secret, password, user_agent, username

reddit = praw.Reddit(client_id=client_id,
                     client_secret=client_secret, password=password,
                     user_agent=user_agent, username=username)

What we've done here so far is just import praw, our API credentials, and setup the Reddit instance. Now for searching:

def find_spam_by_name(search_query):
    authors = []
    for submission in reddit.subreddit("all").search(search_query, sort="new", limit=11):
        print(submission.title, submission.author, submission.url)
        if submission.author not in authors:
            authors.append(submission.author)
    return authors

This will search for a phrase, then return the authors we find from the newest submissions of that phrase. Let's test it:

if __name__ == '__main__':
    authors = find_spam_by_name("Free Udemy")
    for author in authors:
        print(str(author))

Full code up to now:

import praw
from praw_creds import client_id, client_secret, password, user_agent, username


reddit = praw.Reddit(client_id=client_id,
                     client_secret=client_secret, password=password,
                     user_agent=user_agent, username=username)

def find_spam_by_name(search_query):
    authors = []
    for submission in reddit.subreddit("all").search(search_query, sort="new", limit=11):
        print(submission.title, submission.author, submission.url)
        if submission.author not in authors:
            authors.append(submission.author)
    return authors

if __name__ == '__main__':
    authors = find_spam_by_name("Free Udemy")
    for author in authors:
        print(str(author))

Running this gives us:

FREE Udemy course: MVVM Design Pattern Using Swift in iOS mizaksad https://twitter.com/gamingsaledeals/status/953502567945916418
Unity: Beginner to Advanced - Complete Course is available for FREE at Udemy plakucisf https://twitter.com/gamingsaledeals/status/953502892778024960
Unity: Beginner to Advanced - Complete Course is available for FREE at Udemy plakuciss https://twitter.com/gamingsaledeals/status/953502892778024960
Free Udemy Course - Intro To The Steemit Social Media Blogging Platform — Steemit flaplanet https://steemit.com/steemit/@johnelder-org/free-udemy-course-intro-to-the-steemit-social-media-blogging-platform
Udemy Online Courses for free dfslol https://www.dealnews.com/lw/landing.html?uri=%2FUdemy-Online-Courses-for-free%2F2177081.html%3Firef%3Drss-dealnews-todays-edition
Get 260 Udemy Paid Course Free With Deal5star Deal5star https://www.youtube.com/watch?v=J8TZTj3hbt4&feature=youtu.be
FREE Udemy Course: Podcasting: How To Make Your Own Podcast yellowsnow3000 https://twitter.com/somethingometh/status/951047771787874304
FREE Udemy course: Business Goal-Setting Masterclass KellyfromLeedsUK https://www.reddit.com/r/business/comments/7pfeov/free_udemy_course_business_goalsetting_masterclass/?utm_source=ifttt
9 Udemy Courses taught by Chris M Nemo will be FREE! JANUARY,10 , FROM 6 am GMT to 4 PM GMT Deal5star https://deal5star.com/9-udemy-courses-taught-chris-m-nemo-will-free-january10-6-gmt-4-pm-gmt/
FREE Udemy course: Business Goal-Setting Masterclass dealbawt https://www.reddit.com/r/business/comments/7pfeov/free_udemy_course_business_goalsetting_masterclass/?utm_source=ifttt
My Udemy Course Launch! (Free Coupons) BeyondUsGames https://www.reddit.com/r/gamemaker/comments/7p4rsw/my_udemy_course_launch_free_coupons/
mizaksad
plakucisf
plakuciss
flaplanet
dfslol
Deal5star
yellowsnow3000
KellyfromLeedsUK
dealbawt
BeyondUsGames

Great, we've got some authors. Now, just because someone posts something on Reddit about Udemy or some free course, it doesn't mean they're a spammer. We need to look a bit deeper into these accounts.

Let's change our main block now, starting with:

if __name__ == "__main__":
    while True:
        current_search_query = random.choice(["udemy"])
        spam_content = []
        trashy_users = {}
        smelly_authors = find_spam_by_name(current_search_query)

In the interest of possibly adding new spammy sources/phrases/words, I will have the current_search_query be a random choice of varying words. For now, my main focus is on Udemy spam, so that's the only choice, but let's make this script grow-able in the future! Since we're using random here, let's import it:

import random

Now, with these smelly authors, we need to see how much of their content is trash (spam):

        for author in smelly_authors:
            user_trashy_urls = []
            sub_count = 0
            dirty_count = 0

We'll save some starting information, an empty list to populate with submissions, a submission counter and a dirty counter for each submission from each "smelly" author that we're looking into. At the top of our script, let's add some common words that are used with spam:

common_spammy_words = ['udemy','course','save','coupon','free','discount']
            try:
                for sub in reddit.redditor(str(author)).submissions.new():
                    submit_links_to = sub.url
                    submit_id = sub.id 
                    submit_subreddit = sub.subreddit
                    submit_title = sub.title
                    dirty = False
                    for w in common_spammy_words:
                        if w in submit_title.lower():
                            dirty = True
                            junk = [submit_id,submit_title]
                            if junk not in user_trashy_urls:
                                user_trashy_urls.append([submit_id,submit_title,str(author)])

                    if dirty:
                        dirty_count+=1
                    sub_count+=1                                

            except Exception as e:
                print(str(e))  

Above, we begin to iterate through the potentially trashy author's submissions, looking for common_spammy_words in the titles. If we do find them, let's log this, and continue through the author's submissions. Once we've gone through them, let's generate a trashy_score:

            try:
                for sub in reddit.redditor(str(author)).submissions.new():
                    submit_links_to = sub.url
                    submit_id = sub.id 
                    submit_subreddit = sub.subreddit
                    submit_title = sub.title
                    dirty = False
                    for w in common_spammy_words:
                        if w in submit_title.lower():
                            dirty = True
                            junk = [submit_id,submit_title]
                            if junk not in user_trashy_urls:
                                user_trashy_urls.append([submit_id,submit_title,str(author)])

                    if dirty:
                        dirty_count+=1
                    sub_count+=1

                try:
                    trashy_score = dirty_count/sub_count
                except: trashy_score = 0.0
                print("User {} trashy score is: {}".format(str(author), round(trashy_score,3)))

                if trashy_score >= 0.5:
                    trashy_users[str(author)] = [trashy_score,sub_count]

                    for trash in user_trashy_urls:
                        spam_content.append(trash)  

            except Exception as e:
                print(str(e))  

Full code up to this point:

import praw
from praw_creds import client_id, client_secret, password, user_agent, username
import random

common_spammy_words = ['udemy','course','save','coupon','free','discount']

reddit = praw.Reddit(client_id=client_id,
                     client_secret=client_secret, password=password,
                     user_agent=user_agent, username=username)

def find_spam_by_name(search_query):
    authors = []
    for submission in reddit.subreddit("all").search(search_query, sort="new", limit=11):
        print(submission.title, submission.author, submission.url)
        if submission.author not in authors:
            authors.append(submission.author)
    return authors

 
if __name__ == "__main__":
    while True:
        current_search_query = random.choice(["udemy"])
        spam_content = []
        trashy_users = {}
        smelly_authors = find_spam_by_name(current_search_query)
        for author in smelly_authors:
            user_trashy_urls = []
            sub_count = 0
            dirty_count = 0
            try:
                for sub in reddit.redditor(str(author)).submissions.new():
                    submit_links_to = sub.url
                    submit_id = sub.id 
                    submit_subreddit = sub.subreddit
                    submit_title = sub.title
                    dirty = False
                    for w in common_spammy_words:
                        if w in submit_title.lower():
                            dirty = True
                            junk = [submit_id,submit_title]
                            if junk not in user_trashy_urls:
                                user_trashy_urls.append([submit_id,submit_title,str(author)])

                    if dirty:
                        dirty_count+=1
                    sub_count+=1

                try:
                    trashy_score = dirty_count/sub_count
                except: trashy_score = 0.0
                print("User {} trashy score is: {}".format(str(author), round(trashy_score,3)))

                if trashy_score >= 0.5:
                    trashy_users[str(author)] = [trashy_score,sub_count]

                    for trash in user_trashy_urls:
                        spam_content.append(trash)  

            except Exception as e:
                print(str(e))         

Output from one loop of this:

Any Udemy course for $9.99 sepang-moto http://techshippers.com
Udemy : Get Your First SEO Client Using Freelance Sites onlinefreecourses http://offersallin1.com/coupons/udemy-get-your-first-seo-client-using-freelance-sites/
Select Courses $0 at Udemy dfslol https://bensbargains.com/bargain/select-courses-573148/#rss
Udemy best news courses | January 18, 2018 Carolin3 http://mailchi.mp/f56dd2542632/your-daily-best-udemy-courses-selection-1441441
Cisco CCNA 200-125 : Full Course For Networking Basics - Udemy khongbietmatkhau11 http://dlfree24h.com/ebooks/405687-cisco-ccna-200-125-full-course-for-networking-basics.html
FREE Udemy course: MVVM Design Pattern Using Swift in iOS mizaksad https://twitter.com/gamingsaledeals/status/953502567945916418
Unity: Beginner to Advanced - Complete Course is available for FREE at Udemy plakucisf https://twitter.com/gamingsaledeals/status/953502892778024960
Unity: Beginner to Advanced - Complete Course is available for FREE at Udemy plakuciss https://twitter.com/gamingsaledeals/status/953502892778024960
The Complete Steemit Cryptocurrency Course is 92% off at Udemy mizaksaf https://twitter.com/gamingsaledeals/status/953596050627088385
The Complete Steemit Cryptocurrency Course is 92% off at Udemy mizaksad https://twitter.com/gamingsaledeals/status/953596050627088385
Udemy course: Docker Mastery - The Complete Toolset From a Docker Captain is 94% off plakucisf https://www.reddit.com/r/docker/comments/7r0ztj/udemy_course_docker_mastery_the_complete_toolset/
User sepang-moto trashy score is: 0.45
User onlinefreecourses trashy score is: 0.927
User dfslol trashy score is: 0.26
User Carolin3 trashy score is: 1.0
received 404 HTTP response
User mizaksad trashy score is: 0.444
User plakucisf trashy score is: 0.658
User plakuciss trashy score is: 0.592
User mizaksaf trashy score is: 0.333

So we've found at least a few clear spammers, like onlinefreecourses, plakucisf, and plakuciss.

Okay, now what? Well, let's iterate through the spam content, and post some love on it!

Let's go ahead and import time

import time

Then:

        for spam in spam_content:
            spam_id = spam[0]
            spam_user = spam[2]
            submission = reddit.submission(id=spam[0])
            created_time = submission.created_utc
            if time.time()-created_time <= 86400:
                link = "https://reddit.com"+submission.permalink

                message = """*Beep boop*

I am a bot that sniffs out spammers, and this smells like spam.

At least {}% out of the {} submissions from /u/{} appear to be for Udemy affiliate links. 

Don't let spam take over Reddit! Throw it out!

*Bee bop*""".format(round(trashy_users[spam_user][0]*100,2), trashy_users[spam_user][1], spam_user)

                try:
                    with open("posted_urls.txt","r") as f:
                        already_posted = f.read().split('\n')
                    if link not in already_posted:
                        print(message)
                        submission.reply(message)
                        print("We've posted to {} and now we need to sleep for 12 minutes".format(link))
                        with open("posted_urls.txt","a") as f:
                            f.write(link+'\n')
                        time.sleep(12*60)
                        break
                except Exception as e:
                    print(str(e))
                    time.sleep(12*60)

Full code at this point:

import praw
from praw_creds import client_id, client_secret, password, user_agent, username
import random
import time

common_spammy_words = ['udemy','course','save','coupon','free','discount']

reddit = praw.Reddit(client_id=client_id,
                     client_secret=client_secret, password=password,
                     user_agent=user_agent, username=username)

def find_spam_by_name(search_query):
    authors = []
    for submission in reddit.subreddit("all").search(search_query, sort="new", limit=11):
        print(submission.title, submission.author, submission.url)
        if submission.author not in authors:
            authors.append(submission.author)
    return authors

 
if __name__ == "__main__":
    while True:
        current_search_query = random.choice(["udemy"])
        spam_content = []
        trashy_users = {}
        smelly_authors = find_spam_by_name(current_search_query)
        for author in smelly_authors:
            user_trashy_urls = []
            sub_count = 0
            dirty_count = 0
            try:
                for sub in reddit.redditor(str(author)).submissions.new():
                    submit_links_to = sub.url
                    submit_id = sub.id 
                    submit_subreddit = sub.subreddit
                    submit_title = sub.title
                    dirty = False
                    for w in common_spammy_words:
                        if w in submit_title.lower():
                            dirty = True
                            junk = [submit_id,submit_title]
                            if junk not in user_trashy_urls:
                                user_trashy_urls.append([submit_id,submit_title,str(author)])

                    if dirty:
                        dirty_count+=1
                    sub_count+=1

                try:
                    trashy_score = dirty_count/sub_count
                except: trashy_score = 0.0
                print("User {} trashy score is: {}".format(str(author), round(trashy_score,3)))

                if trashy_score >= 0.5:
                    trashy_users[str(author)] = [trashy_score,sub_count]

                    for trash in user_trashy_urls:
                        spam_content.append(trash)  

            except Exception as e:
                print(str(e))

        for spam in spam_content:
            spam_id = spam[0]
            spam_user = spam[2]
            submission = reddit.submission(id=spam[0])
            created_time = submission.created_utc
            if time.time()-created_time <= 86400:
                link = "https://reddit.com"+submission.permalink

                message = """*Beep boop*

I am a bot that sniffs out spammers, and this smells like spam.

At least {}% out of the {} submissions from /u/{} appear to be for Udemy affiliate links. 

Don't let spam take over Reddit! Throw it out!

*Bee bop*""".format(round(trashy_users[spam_user][0]*100,2), trashy_users[spam_user][1], spam_user)

                try:
                    with open("posted_urls.txt","r") as f:
                        already_posted = f.read().split('\n')
                    if link not in already_posted:
                        print(message)
                        submission.reply(message)
                        print("We've posted to {} and now we need to sleep for 12 minutes".format(link))
                        with open("posted_urls.txt","a") as f:
                            f.write(link+'\n')
                        time.sleep(12*60)
                        break
                except Exception as e:
                    print(str(e))
                    time.sleep(12*60)

Running this, we get something like:

Ending with:

Any Udemy course for $9.99 sepang-moto http://techshippers.com
Udemy : Get Your First SEO Client Using Freelance Sites onlinefreecourses http://offersallin1.com/coupons/udemy-get-your-first-seo-client-using-freelance-sites/
Select Courses $0 at Udemy dfslol https://bensbargains.com/bargain/select-courses-573148/#rss
Udemy best news courses | January 18, 2018 Carolin3 http://mailchi.mp/f56dd2542632/your-daily-best-udemy-courses-selection-1441441
Cisco CCNA 200-125 : Full Course For Networking Basics - Udemy khongbietmatkhau11 http://dlfree24h.com/ebooks/405687-cisco-ccna-200-125-full-course-for-networking-basics.html
FREE Udemy course: MVVM Design Pattern Using Swift in iOS mizaksad https://twitter.com/gamingsaledeals/status/953502567945916418
Unity: Beginner to Advanced - Complete Course is available for FREE at Udemy plakucisf https://twitter.com/gamingsaledeals/status/953502892778024960
Unity: Beginner to Advanced - Complete Course is available for FREE at Udemy plakuciss https://twitter.com/gamingsaledeals/status/953502892778024960
The Complete Steemit Cryptocurrency Course is 92% off at Udemy mizaksaf https://twitter.com/gamingsaledeals/status/953596050627088385
The Complete Steemit Cryptocurrency Course is 92% off at Udemy mizaksad https://twitter.com/gamingsaledeals/status/953596050627088385
Udemy course: Docker Mastery - The Complete Toolset From a Docker Captain is 94% off plakucisf https://www.reddit.com/r/docker/comments/7r0ztj/udemy_course_docker_mastery_the_complete_toolset/
User sepang-moto trashy score is: 0.45
User onlinefreecourses trashy score is: 0.927
User dfslol trashy score is: 0.29
User Carolin3 trashy score is: 1.0
received 404 HTTP response
User mizaksad trashy score is: 0.444
User plakucisf trashy score is: 0.658
User plakuciss trashy score is: 0.592
User mizaksaf trashy score is: 0.333
*Beep boop*

I am a bot that sniffs out spammers, and this smells like spam.

At least 92.71% out of the 96 submissions from /u/onlinefreecourses appear to be for Udemy affiliate links. 

Don't let spam take over Reddit! Throw it out!

*Bee bop*
We've posted to https://reddit.com/r/udemyfreebies/comments/7r7nno/udemy_get_your_first_seo_client_using_freelance/ and now we need to sleep for 12 minutes

Speaking of which, the subreddit /r/udemyfreebies/ is basically *all* spam and trash. I am going to guess that most users on here know exactly what this content is, but we'll still post for now.

Want to contribute to this project? I've hosted it on github: reddit_spam_detector_bot.

The next tutorial:





  • Introduction and Basics - Python Reddit API Wrapper (PRAW) tutorial p.1
  • Parsing Reddit Comments - Python Reddit API Wrapper (PRAW) tutorial p.2
  • Streaming from Reddit - Python Reddit API Wrapper (PRAW) tutorial p.3
  • Building a Reddit Bot that Detects Trash - Python Reddit API Wrapper (PRAW) tutorial p.4