[HN Gopher] Show HN: 4B+ DNS Records Dataset
___________________________________________________________________
Show HN: 4B+ DNS Records Dataset
Hi HN, I've been working on building a pipeline to create a DNS
records database lately. The goal is to enable research as well as
competitive landscape analysis on the internet. The dataset for
now spans around 4 billion records and covers all the common DNS
record types: A AAAA ANAME
CAA CNAME HINFO HTTPS MX
NAPTR NS PTR SOA SRV SSHFP
SVCB TLSA TXT Each line in the CSV file
represents a single DNS record in the following format:
www.example.com,A,93.184.215.14 Let me know if you have any
questions or feedback!
Author : Eikon
Score : 73 points
Date : 2024-10-15 19:26 UTC (1 days ago)
(HTM) web link (www.merklemap.com)
(TXT) w3m dump (www.merklemap.com)
| genmud wrote:
| Neat! How is this different than domaintools/farsight [1]?
|
| Passive DNS [2] has been in my toolbox for 15+ years, and is
| invaluable for security research / threat intelligence. Knowing
| historical resolutions to something are so helpful in
| investigations.
|
| For anyone interested, they should check out the talk by one of
| the DomainTools people [3] on how it can be utilized for
| investigation.
|
| Are you passively collecting this data, or actively querying for
| these records?
|
| [1] - https://www.domaintools.com/products/threat-intelligence-
| fee...
|
| [2] - https://www.circl.lu/services/passive-dns/
|
| [3] - https://www.youtube.com/watch?v=oXmapqLkZd0
| Eikon wrote:
| From what I understand [1] is just tlds, not subdomains?
| genmud wrote:
| That would be incorrect, they get subdomains for passive dns
| feeds.
| Eikon wrote:
| Ok, it'd be interesting to know how big is their datasets
| compared to mine and how much they overlap.
| lyu07282 wrote:
| is this making use of letsencrypt as well? afaik all
| letsencrypt signed certificates including all subdomains are
| immediately public, which could be useful for security research
| as well
| Eikon wrote:
| It's not about letsencrypt but certificate transparency which
| works the same for all public CAs.
|
| I wrote a documentation piece here:
|
| https://www.merklemap.com/documentation/how-it-works
| whalesalad wrote:
| At first glance it looks like this data is generated via the
| public certificate transparency log, so I would imagine the
| answer is yes.
| 35mm wrote:
| How often is it updated?
|
| Does it include expired domains?
| Eikon wrote:
| > How often is it updated?
|
| I plan to do 2 releases a month for now, goal is one a day.
|
| > Does it include expired domains?
|
| Yes.
| mh- wrote:
| This is fantastically valuable, especially if you can add the
| first/last-seen as requested by another commenter. Thanks for
| doing this.
| Eikon wrote:
| Thanks.
|
| That's quite a fun project!
| g-mork wrote:
| Any possibility of adding (first seen, last seen) time stamps?
| There is basically no good way to reconstruct the state of e.g.
| SPF at a point in time from existing DNS data sets
| Eikon wrote:
| I could in future releases, yes.
| T3RMINATED wrote:
| Where do you get the data from? Does it include subdomains?
| Eikon wrote:
| Hi,
|
| https://www.merklemap.com/documentation/how-it-works
|
| Basically the same process here but using that data to perform
| DNS queries.
| nhggfu wrote:
| great work OP.
| Eikon wrote:
| Thank you!
| g48ywsJk6w48 wrote:
| Thank you for data set!!! It is not always lowercase, so it have
| some duplicates.
|
| Also you can avoid unnecessary data with analyze CNAME records.
| -- domain.tld CNAME www.domain.tld -- So you can use only
| domain.tld or www.domain.tld records.
| whalesalad wrote:
| 211GB seems _very_ small. How is this generated?
| Eikon wrote:
| What makes you think it's small?
| mobilio wrote:
| note - that records can be geolocation routing.
|
| This mean that from country A i can get records as X, but in
| country B records can be Y.
|
| Would be great if you can make new column in CSV that can show
| about variations - Y/N.
| m3047 wrote:
| I've worked in the industry at IID and Farsight. I am skeptical
| of many claims made by IoC vendors.
|
| You need timestamps, or first / last seen.
|
| Records don't exist in a vacuum. They come in RRsets. They are
| served (sometimes inconsistently) by different nameservers. Some
| use cases care about this.
|
| Records which don't resolve are also useful, especially for use
| cases which amount to front-running. On any given day if the wind
| was blowing the right direction .belkin could be one of the top
| 10 non-resolving TLDs. If your data is any good, check under
| .cisco for stuff which resolves to 127.0.53.53. ;-)
|
| Information about provenance (where the data comes from) is
| required for some use cases.
|
| We shipped Farsight's DNSDB on one or more 1TB drives, depending
| on what the customer was purchasing.
___________________________________________________________________
(page generated 2024-10-16 23:02 UTC)