TLDR: I Did Download It
And now I can analyze it with DuckDB. Behold the fraction of total comments and stories referencing
key topics over time!
Part 1: The Mods Are Asleep, Download It All
As part of building hn.unlurker.com, I wrote a
HN API client. There are already a bunch of other
clients, but I wanted to try the latest Go features and linters on a new project. I’m glad I did; it
was a lot of fun.
The client can retrieve active items, lists of items, etc. (comments and stories are called “items”
in the HN API). Although I only really needed recent items for
my project, for completeness I added “scan” which downloads all the items, in order, from zero to
the latest or the other way around.
I wondered — could I just download the whole thing? Extrapolating from a few thousand items, it
would only be tens of GiB of JSON. I thought I’d give it a try.
hn scan --no-cache --asc -c- -o full.json
I had to CTRL-C a stalled download a few times, but scan is resumable so after a few hours I was
done. I had a 20 GiB JSON file of everything that has ever happened on Hacker News, and I can just
re-run the command above to “top it off” any time I need the latest. But what could I do with it?
Part 2: Feed The Ducks
First I just grepped for things. How many times has the phrase
“correct horse battery staple” appeared on the
site? Quite a few: 231 times (the last one
just today). But grepping stuff is old news, so I
thought I’d try out DuckDB.
In the database world, DuckDB is unique: a super-fast embeddable analytics execution engine also
available as a command-line tool. I spend most of my day wrangling a
different database (there’s the plug my coworkers will be looking
for) but I’ve been meaning to try DuckDB and it seemed perfect for this one-off task.
As it turns out, with their new UI for novices like me, it’s a breeze to use. AND LLMs are pretty
good at helping craft the SQL queries. I just had to import the data:
CREATE TABLE items AS
SELECT *
FROM read_json_auto('/home/jason/full.json', format='nd', sample_size=-1);
Then query it. Here’s a 12-week moving average of the fraction of total items containing the terms I
am interested in:
WITH weekly AS (
SELECT
DATE_TRUNC('week', TO_TIMESTAMP(time)) AS week_start,
COUNT(*) FILTER (WHERE text ILIKE '%python%')::float / NULLIF(COUNT(*),0)
AS python_prop,
COUNT(*) FILTER (WHERE text ILIKE '%javascript%')::float / NULLIF(COUNT(*),0)
AS javascript_prop,
COUNT(*) FILTER (WHERE text ILIKE '%java%')::float / NULLIF(COUNT(*),0)
AS java_prop,
COUNT(*) FILTER (WHERE text ILIKE '%ruby%')::float / NULLIF(COUNT(*),0)
AS ruby_prop,
COUNT(*) FILTER (WHERE text ILIKE '%rust%')::float / NULLIF(COUNT(*),0)
AS rust_prop
FROM items
GROUP BY week_start
)
SELECT
week_start,
AVG(python_prop) OVER (
ORDER BY week_start
ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
) AS avg_python_12w,
AVG(javascript_prop) OVER (
ORDER BY week_start
ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
) AS avg_javascript_12w,
AVG(java_prop) OVER (
ORDER BY week_start
ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
) AS avg_java_12w,
AVG(ruby_prop) OVER (
ORDER BY week_start
ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
) AS avg_ruby_12w,
AVG(rust_prop) OVER (
ORDER BY week_start
ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
) AS avg_rust_12w
FROM weekly
ORDER BY week_start;
Overall DuckDB seems really great for analyzing data sets of this size.
Next Steps
Now that I have a local download of all Hacker News content, I can train hundreds of LLM-based bots
on it and run them as contributors, slowly and inevitably replacing all human text with the output
of a chinese room oscillator perpetually echoing and recycling the past.
Or alternatively, I think for this project I am done. Someone else will have to take it to the next
logical step.
Thanks for reading! Please check out hn.unlurker.com, take a look at my
other articles, or find me on X.