AI-Powered News Aggregator for Bluesky

AI-Powered News Aggregator for Bluesky

April 2nd, 2025  |  Published in Personal

Introduction

I recently posted the code that runs my news analysis bsky account. In part to provide as much transparency as possible for the news items it publishes.

Technical Overview

The system’s architecture is built around a Python application that integrates multiple technologies to form a comprehensive news processing pipeline. Here’s a high-level overview of how it works:

1. News Collection: The system starts by gathering potential news articles from various sources
2. AI Evaluation: Google’s Gemini AI evaluates and selects the most newsworthy content
3. Content Processing: The selected article is fetched and analyzed for paywalls and duplicates
4. Summary Generation: Gemini AI creates a balanced, factual summary
5. Social Media Posting: The summary is posted to Bluesky with proper formatting and hashtags

The key technologies powering this system include:

Python: The core programming language tying everything together
Google’s Gemini AI: For intelligent article selection and summary generation
AT Protocol Client (atproto): For posting to Bluesky
Newspaper3k: For web scraping and article parsing
Selenium: For browser automation when needed to resolve URLs
NLTK: For natural language processing tasks
Pandas: For data handling and manipulation

The system follows a linear flow, with each component handling a specific part of the process:

News Sources → Selection AI → Content Fetcher → Paywall Detection → 
Similarity Checker → Summary Generator → Bluesky Poster

Each component is designed to be modular, allowing for easier maintenance and future enhancements.

Key Features Deep Dive

AI-Powered Article Selection

At the heart of the system is the AI-powered article selection process. Rather than using simple metrics like view counts or recency, I leverage Google’s Gemini AI to evaluate the newsworthiness of potential articles.

The selection process begins by providing the AI with a list of candidate articles and recent posts. Here’s the prompt I use:

prompt = f"""Select the single most newsworthy article that:
1. Has significant public interest or impact
2. Represents meaningful developments rather than speculation
3. Avoids sensationalism and clickbait

Recent posts:
{recent_titles}

Candidates:
{candidate_titles}

Return ONLY the URL and Title in this format:
URL: [selected URL]
TITLE: [selected title]"""

This approach has several advantages:

1. Contextual understanding: The AI can understand the content and context of articles, not just keywords
2. Balance against recent posts: By including recently posted content, the AI avoids selecting similar stories
3. Quality prioritization: The specific criteria help filter out clickbait and promotional content

To ensure diversity, I also randomize the candidate list before evaluation, which prevents the system from getting stuck in content loops or echo chambers.

Content Similarity Detection

One of the biggest challenges in automated news posting is avoiding duplicate or highly similar content. Nobody wants to see the same story multiple times in their feed, even if it comes from different sources.

I implemented a tiered approach to similarity detection that balances efficiency and accuracy:

Tier 1: Basic Keyword Matching

title_words = set(re.sub(r'[^\w\s]', '', article_title.lower()).split())
title_words = {w for w in title_words if len(w) > 3}  # Only keep meaningful words

# Check for basic title similarity first (cheaper than AI check)
for post in posts_to_check:
    if not post.title:
        continue
        
    post_title_words = set(re.sub(r'[^\w\s]', '', post.title.lower()).split())
    post_title_words = {w for w in post_title_words if len(w) > 3}
    
    # If more than 50% of important words match, likely similar content
    if title_words and post_title_words:
        word_overlap = len(title_words.intersection(post_title_words))
        similarity_ratio = word_overlap / min(len(title_words), len(post_title_words))
        
        if similarity_ratio > 0.5:
            logger.info(f"Title keyword similarity detected ({similarity_ratio:.2f}): '{article_title[:50]}...'")
            return True

This first tier catches obvious duplicates without requiring an AI call, saving both time and cost. For more ambiguous cases, the system moves to the second tier:

Tier 2: AI-Powered Similarity Check
The system uses Gemini AI to compare the candidate article with recent posts, determining if they cover the same news event. This approach is more nuanced than keyword matching but also more resource-intensive, which is why it’s used as a second tier.

To optimize API costs, I:
1. Limit the number of posts to compare against (only the 15 most recent)
2. Truncate the article text for comparison (first 500 characters)
3. Simplify the prompt to focus only on determining similarity
4. Add proper error handling and timeouts for API calls

Paywall Detection and Avoidance

Paywalled content creates a frustrating experience for readers who click a link only to find they can’t access the article. The system employs several strategies to detect and avoid paywalled content:

1. Known Paywall Domains: A predefined list of domains known to have strict paywalls:

known_paywall_domains = [
    'sfchronicle.com', 'wsj.com', 'nytimes.com',
    'ft.com', 'bloomberg.com', 'washingtonpost.com',
    'whitehouse.gov', 'treasury.gov', 'justice.gov'
]

2. Content Analysis: For other domains, the system fetches the article and analyzes the content:

if not article.text or len(article.text.split()) < 50:
    known_paywall_phrases = [
        "subscribe", "subscription", "sign in",
        "premium content", "premium article",
        "paid subscribers only"
    ]
    if any(phrase in article.html.lower() for phrase in known_paywall_phrases):
        raise PaywallError(f"Content appears to be behind a paywall: {url}")

This dual approach efficiently filters out paywalled content without excessive processing. It’s a balance between respecting publishers’ business models and ensuring a good experience for readers.

Tweet Generation with Proper Hashtags

Creating concise, informative, and balanced summaries is crucial for a news aggregator. I leverage Gemini AI to generate summaries with specific guidance:

prompt = f"""Write a very balanced tweet in approximately 200 characters about this news. No personal comments. Just the facts. Also add one hashtag at the end of the tweet that isn't '#News'. No greater than 200 characters.  
News article text: {article_text}"""

For Bluesky, proper hashtag formatting is essential. Unlike traditional social media platforms, Bluesky requires special handling of hashtags using “facets” – a way to add rich features to text. I implemented this with careful byte position calculation:

# Facet for the generated hashtag
generated_hashtag_start = len(tweet_text) + 1  # +1 for the space before hashtag
generated_hashtag_end = generated_hashtag_start + len(generated_hashtag) + 1  # +1 for the # symbol

facets.append(
    models.AppBskyRichtextFacet.Main(
        features=[
            models.AppBskyRichtextFacet.Tag(
                tag=generated_hashtag
            )
        ],
        index=models.AppBskyRichtextFacet.ByteSlice(
            byteStart=generated_hashtag_start,
            byteEnd=generated_hashtag_end
        )
    )
)

For error handling, I implemented fallback mechanisms to ensure posts still go out even if the AI encounters issues:

1. A simplified tweet format with basic information if AI summary generation fails
2. Default hashtags when custom ones can’t be generated
3. Proper error logging for debugging and improvement

Challenges and Solutions

URL Tracking and History Management

One of the early challenges was preventing the system from posting the same article multiple times. I implemented a URL history tracking system that stores previously posted URLs in a text file:

def _add_url_to_history(self, url: str):
    """Add a URL to the history file and clean up if needed."""
    try:
        # Get existing URLs
        urls = self._get_posted_urls()
        
        # Add new URL if it's not already in the list
        if url not in urls:
            urls.append(url)
        
        # Check if we need to clean up
        if len(urls) > self.max_history_lines:
            logger.info(f"URL history exceeds {self.max_history_lines} entries, removing oldest {self.cleanup_threshold}")
            urls = urls[self.cleanup_threshold:]
        
        # Write back to file
        with open(self.url_history_file, 'w') as f:
            for u in urls:
                f.write(f"{u}\n")
        
        logger.info(f"Added URL to history file: {url}")
    except Exception as e:
        logger.error(f"Error adding URL to history file: {e}")

To prevent the history file from growing indefinitely, I implemented a cleanup strategy:

1. Set a maximum number of URLs to store (100 in current implementation)
2. When this limit is exceeded, remove the oldest entries (10 at a time)
3. Only add URLs that aren’t already in the list

This approach ensures efficient storage while maintaining enough history to prevent duplicates.

Browser Automation for URL Resolution

A technical challenge arose when dealing with Google News URLs, which are often redirect links rather than direct article URLs. To resolve these to their actual destinations, I implemented a Selenium-based browser automation solution:

def get_real_url(self, google_url: str) -> Optional[str]:
    """
    Get the real article URL from a Google News URL using Selenium.
    """
    driver = None
    try:
        chrome_options = Options()
        chrome_options.add_argument('--headless')
        chrome_options.add_argument('--ignore-certificate-errors')
        chrome_options.add_argument('--ignore-ssl-errors')
        chrome_options.add_argument('--disable-gpu')
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument('--disable-dev-shm-usage')
        chrome_options.add_argument('--log-level=3')
        chrome_options.add_experimental_option('excludeSwitches', ['enable-logging'])

        service = Service(log_output=os.devnull)
        driver = webdriver.Chrome(options=chrome_options, service=service)
        
        driver.get(google_url)
        time.sleep(3)  # Allow redirect to complete
        return driver.current_url

    except Exception as e:
        logger.error(f"Error extracting real URL: {e}")
        return None
        
    finally:
        if driver:
            driver.quit()

I optimized this process by:

1. Using headless mode to avoid opening visible browser windows
2. Disabling unnecessary features to improve performance
3. Implementing proper error handling and resource cleanup
4. Suppressing verbose logs to keep the application output clean

Optimizing API Usage

AI API calls can be expensive, so optimizing their usage was crucial for sustainability. I implemented several strategies:

1. Model Selection Based on Cost-Effectiveness:

preferred_models = [
    # Prioritize more cost-effective models first
    'gemini-1.5-flash',  # Good balance of capability and cost
    'gemini-flash',      # If available, even more cost-effective
    'gemini-1.0-pro',    # Fallback to older model
    'gemini-pro',        # Another fallback
    # Only use the most expensive models as last resort
    'gemini-1.5-pro',
    'gemini-1.5-pro-latest',
    'gemini-2.0-pro-exp'
]

2. Tiered Approaches for Expensive Operations: Using cheaper methods first (like keyword matching) before resorting to AI calls

3. Prompt Optimization: Crafting concise prompts that minimize token usage while maintaining effectiveness

4. Robust Error Handling and Fallbacks: Ensuring the system can continue operating even if AI services fail or return unexpected results

These optimizations reduced API costs significantly while maintaining the quality of the system’s output.

Results and Impact

The automated news aggregator has been successfully running for several months, posting high-quality news updates to Bluesky. Here are some key observations:

1. Consistent Post Quality: The AI selection process has proven effective at identifying genuinely newsworthy content, with minimal instances of clickbait or low-value articles slipping through.

2. User Engagement: Posts from the system consistently receive more engagement than similar manual posts, likely due to the balanced tone and focus on substantive news.

3. Reliability: The error handling and retry mechanisms have ensured near-perfect uptime, with the system recovering gracefully from temporary issues with news sources or API services.

4. Content Diversity: The randomization in the selection process has resulted in a good mix of topics rather than focusing exclusively on mainstream headlines.

One interesting observation has been the performance of different news categories. Technology and science articles tend to receive the most engagement, while political news generates more replies but fewer reposts.

Future Improvements

While the current system is functioning well, several potential improvements could enhance its capabilities:

1. Enhanced Content Categories: Implementing topic classification to ensure a balanced mix of news categories (politics, technology, science, etc.)

2. User Feedback Integration: Creating a mechanism for followers to influence content selection through engagement metrics

3. Multi-Platform Support: Extending the system to post to additional platforms beyond Bluesky

4. Custom Image Generation: Using AI to generate relevant images for posts when the original article doesn’t provide suitable visuals

5. Sentiment Analysis: Adding sentiment analysis to ensure balanced coverage of both positive and negative news

Technical Lessons Learned

This project offered several valuable technical insights:

1. Prompt Engineering is Crucial: The quality of AI outputs depends heavily on how questions are framed. Specific, clear prompts with examples yield better results than vague instructions.

2. Cost Optimization Requires Layered Approaches: Using a combination of simple heuristics and AI creates the best balance between cost and quality.

3. Error Handling is Non-Negotiable: When working with multiple external services (news sites, AI APIs, social platforms), robust error handling is essential for reliability.

4. Modular Design Pays Off: The decision to create separate, focused components made debugging and enhancement much easier as the project evolved.

5. Data Persistence Matters: Even simple persistence mechanisms (like the URL history file) significantly improve system behavior over time.

Conclusion

Building an AI-powered news aggregator for Bluesky has been a fascinating journey at the intersection of artificial intelligence, content curation, and social media. The system demonstrates how AI can enhance content selection and creation while still requiring thoughtful human design and oversight.

The project shows the potential for automated systems to deliver high-quality content that respects both readers’ time and publishers’ content. By focusing on genuine newsworthiness rather than clickbait metrics, the system contributes positively to the information ecosystem.

I’m continuing to refine and enhance the system based on user feedback and performance metrics. If you’re interested in building something similar or have suggestions for improvement, I’d love to hear from you in the comments.

Resources

GitHub Repository: https://github.com/Eric-Ness/NewsAnalysisBSkyPoster
AT Protocol Documentation: https://atproto.com/docs
Google Gemini AI: https://ai.google.dev/
Newspaper3k Documentation: https://newspaper.readthedocs.io/



Related Posts

Exploring Negative r2 in Ablation Studies
Starting a Ph.D. in Computer Science
Monte Carlo Simulations in C#
Cryptanalysis Using n-Gram Probabilities
Apriori Algorithm

Archives