[HN Gopher] Show HN: 4B+ DNS Records Dataset
       ___________________________________________________________________
        
       Show HN: 4B+ DNS Records Dataset
        
       Hi HN,  I've been working on building a pipeline to create a DNS
       records database lately. The goal is to enable research as well as
       competitive landscape analysis on the internet.  The dataset for
       now spans around 4 billion records and covers all the common DNS
       record types:                   A         AAAA          ANAME
       CAA         CNAME         HINFO         HTTPS         MX
       NAPTR         NS         PTR          SOA         SRV         SSHFP
       SVCB         TLSA         TXT       Each line in the CSV file
       represents a single DNS record in the following format:
       www.example.com,A,93.184.215.14  Let me know if you have any
       questions or feedback!
        
       Author : Eikon
       Score  : 73 points
       Date   : 2024-10-15 19:26 UTC (1 days ago)
        
 (HTM) web link (www.merklemap.com)
 (TXT) w3m dump (www.merklemap.com)
        
       | genmud wrote:
       | Neat! How is this different than domaintools/farsight [1]?
       | 
       | Passive DNS [2] has been in my toolbox for 15+ years, and is
       | invaluable for security research / threat intelligence. Knowing
       | historical resolutions to something are so helpful in
       | investigations.
       | 
       | For anyone interested, they should check out the talk by one of
       | the DomainTools people [3] on how it can be utilized for
       | investigation.
       | 
       | Are you passively collecting this data, or actively querying for
       | these records?
       | 
       | [1] - https://www.domaintools.com/products/threat-intelligence-
       | fee...
       | 
       | [2] - https://www.circl.lu/services/passive-dns/
       | 
       | [3] - https://www.youtube.com/watch?v=oXmapqLkZd0
        
         | Eikon wrote:
         | From what I understand [1] is just tlds, not subdomains?
        
           | genmud wrote:
           | That would be incorrect, they get subdomains for passive dns
           | feeds.
        
             | Eikon wrote:
             | Ok, it'd be interesting to know how big is their datasets
             | compared to mine and how much they overlap.
        
         | lyu07282 wrote:
         | is this making use of letsencrypt as well? afaik all
         | letsencrypt signed certificates including all subdomains are
         | immediately public, which could be useful for security research
         | as well
        
           | Eikon wrote:
           | It's not about letsencrypt but certificate transparency which
           | works the same for all public CAs.
           | 
           | I wrote a documentation piece here:
           | 
           | https://www.merklemap.com/documentation/how-it-works
        
           | whalesalad wrote:
           | At first glance it looks like this data is generated via the
           | public certificate transparency log, so I would imagine the
           | answer is yes.
        
       | 35mm wrote:
       | How often is it updated?
       | 
       | Does it include expired domains?
        
         | Eikon wrote:
         | > How often is it updated?
         | 
         | I plan to do 2 releases a month for now, goal is one a day.
         | 
         | > Does it include expired domains?
         | 
         | Yes.
        
           | mh- wrote:
           | This is fantastically valuable, especially if you can add the
           | first/last-seen as requested by another commenter. Thanks for
           | doing this.
        
             | Eikon wrote:
             | Thanks.
             | 
             | That's quite a fun project!
        
       | g-mork wrote:
       | Any possibility of adding (first seen, last seen) time stamps?
       | There is basically no good way to reconstruct the state of e.g.
       | SPF at a point in time from existing DNS data sets
        
         | Eikon wrote:
         | I could in future releases, yes.
        
       | T3RMINATED wrote:
       | Where do you get the data from? Does it include subdomains?
        
         | Eikon wrote:
         | Hi,
         | 
         | https://www.merklemap.com/documentation/how-it-works
         | 
         | Basically the same process here but using that data to perform
         | DNS queries.
        
       | nhggfu wrote:
       | great work OP.
        
         | Eikon wrote:
         | Thank you!
        
       | g48ywsJk6w48 wrote:
       | Thank you for data set!!! It is not always lowercase, so it have
       | some duplicates.
       | 
       | Also you can avoid unnecessary data with analyze CNAME records.
       | -- domain.tld CNAME www.domain.tld -- So you can use only
       | domain.tld or www.domain.tld records.
        
       | whalesalad wrote:
       | 211GB seems _very_ small. How is this generated?
        
         | Eikon wrote:
         | What makes you think it's small?
        
       | mobilio wrote:
       | note - that records can be geolocation routing.
       | 
       | This mean that from country A i can get records as X, but in
       | country B records can be Y.
       | 
       | Would be great if you can make new column in CSV that can show
       | about variations - Y/N.
        
       | m3047 wrote:
       | I've worked in the industry at IID and Farsight. I am skeptical
       | of many claims made by IoC vendors.
       | 
       | You need timestamps, or first / last seen.
       | 
       | Records don't exist in a vacuum. They come in RRsets. They are
       | served (sometimes inconsistently) by different nameservers. Some
       | use cases care about this.
       | 
       | Records which don't resolve are also useful, especially for use
       | cases which amount to front-running. On any given day if the wind
       | was blowing the right direction .belkin could be one of the top
       | 10 non-resolving TLDs. If your data is any good, check under
       | .cisco for stuff which resolves to 127.0.53.53. ;-)
       | 
       | Information about provenance (where the data comes from) is
       | required for some use cases.
       | 
       | We shipped Farsight's DNSDB on one or more 1TB drives, depending
       | on what the customer was purchasing.
        
       ___________________________________________________________________
       (page generated 2024-10-16 23:02 UTC)