Reddit Scraper Tool

The Reddit Scraper Tool is a custom tool that allows the Newsletter AI Agent to gather discussions from Reddit using an Apify actor. It provides a way to collect community insights and discussions related to the specified topic.

Overview

The Reddit Scraper Tool is primarily used by the Researcher Agent to gather community discussions about the specified topic. It provides a flexible interface for searching Reddit and extracting structured data from posts and comments.

Implementation

The Reddit Scraper Tool is implemented as a CrewAI BaseTool that interacts with an Apify Reddit scraper actor. Here’s the implementation:

from crewai.tools import BaseTool
from pydantic import BaseModel, Field, ConfigDict
from typing import List, Optional, Literal
from apify import Actor
from src.tools.base import RunApifyActor

class RedditScraperInput(BaseModel):
    """Input schema for RedditScraper tool."""
    searches: List[str] = Field(
        description="Here you can provide a search query which will be used to search Reddit's topics."
    )

    startUrls: Optional[List[str]] = Field(
        description="If you already have URL(s) of page(s) you wish to scrape, you can set them here. If you want to use the search field below, remove all startUrls here.",
        default=None
    )
    
    skipComments: Optional[bool] = Field(
        default=False,
        description="This will skip scrapping comments when going through posts"
    )
    
    # Additional parameters...

class RedditScraperTool(BaseTool):
    name: str = "Reddit Scraper"
    description: str = "Tool for scraping Reddit content with configurable parameters"
    args_schema: type[BaseModel] = RedditScraperInput
    actor: Actor = Field(description="Apify Actor instance")
    model_config = ConfigDict(arbitrary_types_allowed=True)
    
    def _run(
        self,
        searches: List[str],
        startUrls: Optional[List[str]] = None,
        skipComments: Optional[bool] = False,
        # Additional parameters...
    ) -> str:
        run_inputs = {}
        
        if searches:
            run_inputs["searches"] = searches
        if startUrls:
            run_inputs["startUrls"] = startUrls
        if skipComments:
            run_inputs["skipComments"] = skipComments
        # Set additional parameters...
        
        run_actor = RunApifyActor(self.actor)
        dataset = run_actor._run("reddit-scraper-actor-name", run_inputs)
        return dataset

Parameters

The Reddit Scraper Tool accepts the following parameters:

ParameterTypeDescriptionDefault
searchesList[str]Search queries for Reddit topicsRequired
startUrlsList[str]Direct URLs to Reddit pages to scrapeNone
skipCommentsboolSkip scraping comments when processing postsFalse
skipUserPostsboolSkip scraping user posts when processing user activityFalse
skipCommunityboolSkip scraping community info but still get community postsFalse
searchPostsboolSearch for posts with the provided searchTrue
searchCommentsboolSearch for comments with the provided searchFalse
searchCommunitiesboolSearch for communities with the provided searchFalse
searchUsersboolSearch for users with the provided searchFalse
sortstrHow to sort the results (e.g., “new”, “top”, “hot”)“new”
timestrTime filter for resultsNone
includeNSFWboolInclude NSFW content in resultsTrue
maxPostCountintMaximum number of posts to retrieve20
maxCommentsintMaximum number of comments to retrieve per post20
maxCommunitiesCountintMaximum number of communities to retrieve2
maxUserCountintMaximum number of users to retrieve2

Usage

The Reddit Scraper Tool is used by the Researcher Agent to gather community discussions about the specified topic:

# Initialize the tool
reddit_tool = RedditScraperTool(actor=actor)

# Use the tool
reddit_results = reddit_tool._run(
    searches=[topic],
    searchPosts=True,
    searchComments=False,
    sort="relevance",
    maxPostCount=10
)

Return Value

The tool returns a list of Reddit posts and comments, where each item is a dictionary containing information about a post or comment, including:

  • title: The title of the post (for posts only)
  • text: The text content of the post or comment
  • url: The URL of the post or comment
  • author: The username of the author
  • score: The score (upvotes - downvotes) of the post or comment
  • created: The creation date of the post or comment
  • Additional metadata about the post or comment

Apify Integration

The tool uses an Apify Reddit scraper actor, which provides several advantages:

  1. Scalability: The actor can handle large numbers of Reddit searches efficiently
  2. Reliability: The actor is designed to handle rate limiting and other issues that can arise when scraping Reddit
  3. Structured Data: The actor returns Reddit posts and comments in a structured format that is easy to process

Configuration

To use the Reddit Scraper Tool, you need to set up the following environment variables:

APIFY_API_KEY=your_apify_api_key_here

Next Steps