You Wouldn't Download a Hacker News

TLDR: I Did Download It

And now I can analyze it with DuckDB. Behold the fraction of total comments and stories referencing
key topics over time!

Part 1: The Mods Are Asleep, Download It All

As part of building hn.unlurker.com, I wrote a
HN API client. There are already a bunch of other
clients, but I wanted to try the latest Go features and linters on a new project. I’m glad I did; it
was a lot of fun.

The client can retrieve active items, lists of items, etc. (comments and stories are called “items”
in the HN API). Although I only really needed recent items for
my project, for completeness I added “scan” which downloads all the items, in order, from zero to
the latest or the other way around.

I wondered — could I just download the whole thing? Extrapolating from a few thousand items, it
would only be tens of GiB of JSON. I thought I’d give it a try.

hn scan --no-cache --asc -c- -o full.json

I had to CTRL-C a stalled download a few times, but scan is resumable so after a few hours I was
done. I had a 20 GiB JSON file of everything that has ever happened on Hacker News, and I can just
re-run the command above to “top it off” any time I need the latest. But what could I do with it?

Part 2: Feed The Ducks

First I just grepped for things. How many times has the phrase
“correct horse battery staple” appeared on the
site? Quite a few: 231 times (the last one
just today). But grepping stuff is old news, so I
thought I’d try out DuckDB.

In the database world, DuckDB is unique: a super-fast embeddable analytics execution engine also
available as a command-line tool. I spend most of my day wrangling a
different database (there’s the plug my coworkers will be looking
for) but I’ve been meaning to try DuckDB and it seemed perfect for this one-off task.

As it turns out, with their new UI for novices like me, it’s a breeze to use. AND LLMs are pretty
good at helping craft the SQL queries. I just had to import the data:

CREATE TABLE items AS
SELECT *
FROM read_json_auto('/home/jason/full.json', format='nd', sample_size=-1);

Then query it. Here’s a 12-week moving average of the fraction of total items containing the terms I
am interested in:

WITH weekly AS (
  SELECT
    DATE_TRUNC('week', TO_TIMESTAMP(time)) AS week_start,
    COUNT(*) FILTER (WHERE text ILIKE '%python%')::float     / NULLIF(COUNT(*),0)
      AS python_prop,
    COUNT(*) FILTER (WHERE text ILIKE '%javascript%')::float / NULLIF(COUNT(*),0)
      AS javascript_prop,
    COUNT(*) FILTER (WHERE text ILIKE '%java%')::float       / NULLIF(COUNT(*),0)
      AS java_prop,
    COUNT(*) FILTER (WHERE text ILIKE '%ruby%')::float       / NULLIF(COUNT(*),0)
      AS ruby_prop,
    COUNT(*) FILTER (WHERE text ILIKE '%rust%')::float       / NULLIF(COUNT(*),0)
      AS rust_prop
  FROM items
  GROUP BY week_start
)
SELECT
  week_start,
  AVG(python_prop)     OVER (
    ORDER BY week_start
    ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
  ) AS avg_python_12w,
  AVG(javascript_prop) OVER (
    ORDER BY week_start
    ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
  ) AS avg_javascript_12w,
  AVG(java_prop)       OVER (
    ORDER BY week_start
    ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
  ) AS avg_java_12w,
  AVG(ruby_prop)       OVER (
    ORDER BY week_start
    ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
  ) AS avg_ruby_12w,
  AVG(rust_prop)       OVER (
    ORDER BY week_start
    ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
  ) AS avg_rust_12w
FROM weekly
ORDER BY week_start;

Overall DuckDB seems really great for analyzing data sets of this size.

Next Steps

Now that I have a local download of all Hacker News content, I can train hundreds of LLM-based bots
on it and run them as contributors, slowly and inevitably replacing all human text with the output
of a chinese room oscillator perpetually echoing and recycling the past.

Or alternatively, I think for this project I am done. Someone else will have to take it to the next
logical step.

Thanks for reading! Please check out hn.unlurker.com, take a look at my
other articles, or find me on X.

3D printing 3D scanning 5G 6G Adaptive learning AI AI ethics AI governance AI-driven automation AI-driven chatbots AI-driven healthcare AR/VR (Augmented and Virtual Reality)Artificial intelligence Augmented reality Automation Autonomous drones Autonomous vehicles Big data Bioinformatics Biometric security Blockchain Blockchain security Blockchain-as-a-Service Chatbots Cloud computing Cloud infrastructure Cloud security Cloud-native applications Cognitive computing Cryptocurrency Cyber defense Cyber-physical systems Cybersecurity Cybersecurity frameworks Data analytics Data governance Data lakes Data mining Data privacy Deep learning DevOps Digital currency Digital ecosystems Digital payments Digital transformation Digital twins Digital wallets Drones Edge AI Edge computing eSIM technology Fintech Fintech innovation Geospatial analytics Gig economy platforms Green technology Human augmentation Hybrid cloud Hyperautomation Image recognition Intelligent apps Internet of Behaviors (IoB)IoT (Internet of Things)IT operations IT security Machine learning Metaverse Microservices Mobile app development Multi-cloud environments Multi-factor authentication Natural language processing Neural networks Open-source software Predictive analytics Privacy-enhancing technologies Quantum computing Quantum encryption Quantum sensors Renewable energy storage Renewable energy tech Robotics Robotics process automation (RPA)SaaS (Software as a Service)Self-driving cars Serverless computing Smart cities Smart contracts Smart devices Smart grids Smart homes Supply chain tech Tech sustainability Video streaming Virtual assistants Virtual reality Voice recognition Wearable health tech Wearable technology Zero-trust security

You Wouldn’t Download a Hacker News

TLDR: I Did Download It

Part 1: The Mods Are Asleep, Download It All

Part 2: Feed The Ducks

Next Steps

Raphinha goes from missing man to Barcelona’s Ballon d’Or contender | Barcelona

Trump eases tariffs on imported auto parts through executive order – The Washington Post

Related Posts

Leave a Comment Cancel Reply