https://github.com/projectdiscovery/katana

Skip to content Toggle navigation
 
Sign up

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
      + Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
      + By Plan
      + Enterprise
      + Teams
      + Compare all
      + By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
      + Case Studies
      + Customer Stories
      + Resources
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
      + Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

[                    ] 

  *  
    #
    In this repository All GitHub |
    Jump to |

  * No suggested jump to results

  *  
    #
    In this repository All GitHub |
    Jump to |
  *  
    #
    In this organization All GitHub |
    Jump to |
  *  
    #
    In this repository All GitHub |
    Jump to |

Sign in
Sign up
{{ message }}
projectdiscovery / katana Public

  * Notifications
  * Fork 103
  * Star 2.5k

A next-generation crawling and spidering framework.

License

MIT license
2.5k stars 103 forks
Star
Notifications

  * Code
  * Issues 35
  * Pull requests 2
  * Discussions
  * Actions
  * Projects 0
  * Security
  * Insights

More

  * Code
  * Issues
  * Pull requests
  * Discussions
  * Actions
  * Projects
  * Security
  * Insights

projectdiscovery/katana

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
main
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags

Name already in use

A tag already exists with the provided branch name. Many Git commands
accept both tag and branch names, so creating this branch may cause
unexpected behavior. Are you sure you want to create this branch?
Cancel Create
3 branches 1 tag
Code

  * Local
  * Codespaces

  *  
    Clone
    HTTPS GitHub CLI
    [https://github.com/p]

    Use Git or checkout with SVN using the web URL.

    [gh repo clone projec]

    Work fast with our official CLI. Learn more.

  * Open with GitHub Desktop
  * Download ZIP

  * Codespaces is rolling out

    You don't have access just yet, but in the meantime, you can
    learn about Codespaces

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

@ehsandeep
ehsandeep Added SECURITY.md
...
22fa3fe Nov 9, 2022
Added SECURITY.md
22fa3fe

Git stats

  * 4 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
.github
 
 
cmd
 
 
internal/runner
 
 
pkg
 
 
.gitignore
 
 
.goreleaser.yml
 
 
Dockerfile
 
 
LICENSE.md
 
 
README.md
 
 
SECURITY.md
 
 
go.mod
 
 
go.sum
 
 
View code
[                    ]
A next-generation crawling and spidering framework Features
Installation Usage Running Katana Input for katana URL Input Multiple
URL Input (comma-separated) List Input STDIN (piped) Input Crawling
Mode Standard Mode Headless Mode Scope Control -field-scope
-crawl-scope -crawl-out-scope -no-scope -display-out-scope Crawler
Configuration -depth -js-crawl -crawl-duration -known-files
-automatic-form-fill Filters -field -store-field Rate Limit & Delay
-delay -concurrency -parallelism -rate-limit -rate-limit-minute
Output -json

README.md

                                katana

          A next-generation crawling and spidering framework

  [6874747073] [6874747073] [6874747073] [6874747073] [6874747073]

  Features * Installation * Usage * Scope * Config * Filters * Join
                               Discord

 Features

image

  * Fast And fully configurable web crawling
  * Standard and Headless mode support
  * JavaScript parsing / crawling
  * Customizable automatic form filling
  * Scope control - Preconfigured field / Regex
  * Customizable output - Preconfigured fields
  * INPUT - STDIN, URL and LIST
  * OUTPUT - STDOUT, FILE and JSON

 Installation

katana requires Go 1.18 to install successfully. To install, just run
the below command or download pre-compiled binary from release page.

go install github.com/projectdiscovery/katana/cmd/katana@latest

 Usage

katana -h

This will display help for the tool. Here are all the switches it
supports.

Usage:
  ./katana [flags]

Flags:
INPUT:
   -u, -list string[]  target url / list to crawl

CONFIGURATION:
   -d, -depth int                maximum depth to crawl (default 2)
   -jc, -js-crawl                enable endpoint parsing / crawling in javascript file
   -ct, -crawl-duration int      maximum duration to crawl the target for
   -kf, -known-files string      enable crawling of known files (all,robotstxt,sitemapxml)
   -mrs, -max-response-size int  maximum response size to read (default 2097152)
   -timeout int                  time to wait for request in seconds (default 10)
   -aff, -automatic-form-fill    enable optional automatic form filling (experimental)
   -retry int                    number of times to retry the request (default 1)
   -proxy string                 http/socks5 proxy to use
   -H, -headers string[]         custom header/cookie to include in request
   -config string                path to the katana configuration file
   -fc, -form-config string      path to custom form configuration file

HEADLESS:
   -hl, -headless       enable headless hybrid crawling (experimental)
   -sc, -system-chrome  use local installed chrome browser instead of katana installed
   -sb, -show-browser   show the browser on the screen with headless mode

SCOPE:
   -cs, -crawl-scope string[]       in scope url regex to be followed by crawler
   -cos, -crawl-out-scope string[]  out of scope url regex to be excluded by crawler
   -fs, -field-scope string         pre-defined scope field (dn,rdn,fqdn) (default "rdn")
   -ns, -no-scope                   disables host based default scope
   -do, -display-out-scope          display external endpoint from scoped crawling

FILTER:
   -f, -field string                field to display in output (url,path,fqdn,rdn,rurl,qurl,qpath,file,key,value,kv,dir,udir)
   -sf, -store-field string         field to store in per-host output (url,path,fqdn,rdn,rurl,qurl,qpath,file,key,value,kv,dir,udir)
   -em, -extension-match string[]   match output for given extension (eg, -em php,html,js)
   -ef, -extension-filter string[]  filter output for given extension (eg, -ef png,css)

RATE-LIMIT:
   -c, -concurrency int          number of concurrent fetchers to use (default 10)
   -p, -parallelism int          number of concurrent inputs to process (default 10)
   -rd, -delay int               request delay between each request in seconds
   -rl, -rate-limit int          maximum requests to send per second (default 150)
   -rlm, -rate-limit-minute int  maximum number of requests to send per minute

OUTPUT:
   -o, -output string  file to write output to
   -j, -json           write output in JSONL(ines) format
   -nc, -no-color      disable output content coloring (ANSI escape codes)
   -silent             display output only
   -v, -verbose        display verbose output
   -version            display project version

 Running Katana

 Input for katana

katana requires url or endpoint to crawl and accepts single or
multiple inputs.

Input URL can be provided using -u option, and multiple values can be
provided using comma-separated input, similarly file input is
supported using -list option and additionally piped input (stdin) is
also supported.

 URL Input

katana -u https://tesla.com

 Multiple URL Input (comma-separated)

katana -u https://tesla.com,https://google.com

 List Input

$ cat url_list.txt

https://tesla.com
https://google.com

katana -list url_list.txt

 STDIN (piped) Input

echo https://tesla.com | katana

cat domains | httpx | katana

Example running katana -

katana -u https://youtube.com

   __        __                
  / /_____ _/ /____ ____  ___ _
 /  '_/ _  / __/ _  / _ \/ _  /
/_/\_\\_,_/\__/\_,_/_//_/\_,_/ v0.0.1                     

      projectdiscovery.io

[WRN] Use with caution. You are responsible for your actions.
[WRN] Developers assume no liability and are not responsible for any misuse or damage.
https://www.youtube.com/
https://www.youtube.com/about/
https://www.youtube.com/about/press/
https://www.youtube.com/about/copyright/
https://www.youtube.com/t/contact_us/
https://www.youtube.com/creators/
https://www.youtube.com/ads/
https://www.youtube.com/t/terms
https://www.youtube.com/t/privacy
https://www.youtube.com/about/policies/
https://www.youtube.com/howyoutubeworks?utm_campaign=ytgen&utm_source=ythp&utm_medium=LeftNav&utm_content=txt&u=https%3A%2F%2Fwww.youtube.com%2Fhowyoutubeworks%3Futm_source%3Dythp%26utm_medium%3DLeftNav%26utm_campaign%3Dytgen
https://www.youtube.com/new
https://m.youtube.com/
https://www.youtube.com/s/desktop/4965577f/jsbin/desktop_polymer.vflset/desktop_polymer.js
https://www.youtube.com/s/desktop/4965577f/cssbin/www-main-desktop-home-page-skeleton.css
https://www.youtube.com/s/desktop/4965577f/cssbin/www-onepick.css
https://www.youtube.com/s/_/ytmainappweb/_/ss/k=ytmainappweb.kevlar_base.0Zo5FUcPkCg.L.B1.O/am=gAE/d=0/rs=AGKMywG5nh5Qp-BGPbOaI1evhF5BVGRZGA
https://www.youtube.com/opensearch?locale=en_GB
https://www.youtube.com/manifest.webmanifest
https://www.youtube.com/s/desktop/4965577f/cssbin/www-main-desktop-watch-page-skeleton.css
https://www.youtube.com/s/desktop/4965577f/jsbin/web-animations-next-lite.min.vflset/web-animations-next-lite.min.js
https://www.youtube.com/s/desktop/4965577f/jsbin/custom-elements-es5-adapter.vflset/custom-elements-es5-adapter.js
https://www.youtube.com/s/desktop/4965577f/jsbin/webcomponents-sd.vflset/webcomponents-sd.js
https://www.youtube.com/s/desktop/4965577f/jsbin/intersection-observer.min.vflset/intersection-observer.min.js
https://www.youtube.com/s/desktop/4965577f/jsbin/scheduler.vflset/scheduler.js
https://www.youtube.com/s/desktop/4965577f/jsbin/www-i18n-constants-en_GB.vflset/www-i18n-constants.js
https://www.youtube.com/s/desktop/4965577f/jsbin/www-tampering.vflset/www-tampering.js
https://www.youtube.com/s/desktop/4965577f/jsbin/spf.vflset/spf.js
https://www.youtube.com/s/desktop/4965577f/jsbin/network.vflset/network.js
https://www.youtube.com/howyoutubeworks/
https://www.youtube.com/trends/
https://www.youtube.com/jobs/
https://www.youtube.com/kids/

 Crawling Mode

 Standard Mode

Standard crawling modality uses the standard go http library under
the hood to handle HTTP requests/responses. This modality is much
faster as it doesn't have the browser overhead. Still, it analyzes
HTTP responses body as is, without any javascript or DOM rendering,
potentially missing post-dom-rendered endpoints or asynchronous
endpoint calls that might happen in complex web applications
depending, for example, on browser-specific events.

 Headless Mode

Headless mode hooks internal headless calls to handle HTTP requests/
responses directly within the browser context. This offers two
advantages:

  * The HTTP fingerprint (TLS and user agent) fully identify the
    client as a legitimate browser
  * Better coverage since the endpoints are discovered analyzing the
    standard raw response, as in the previous modality, and also the
    browser-rendered one with javascript enabled.

Headless crawling is optional and can be enabled using -headless
option.

Here are other headless CLI options -

katana -h headless

Flags:
HEADLESS:
   -hl, -headless       enable experimental headless hybrid crawling
   -sc, -system-chrome  use local installed chrome browser instead of katana installed
   -sb, -show-browser   show the browser on the screen with headless mode

 Scope Control

Crawling can be endless if not scoped, as such katana comes with
multiple support to define the crawl scope.

 -field-scope

Most handy option to define scope with predefined field name, rdn
being default option for field scope.

  * rdn - crawling scoped to root domain name and all subdomains
    (default)
  * fqdn - crawling scoped to given sub(domain)
  * dn - crawling scoped to domain name keyword

katana -u https://tesla.com -fs dn

 -crawl-scope

For advanced scope control, -cs option can be used that comes with
regex support.

katana -u https://tesla.com -cs login

For multiple in scope rules, file input with multiline string / regex
can be passed.

$ cat in_scope.txt

login/
admin/
app/
wordpress/

katana -u https://tesla.com -cs in_scope.txt

 -crawl-out-scope

For defining what not to crawl, -cos option can be used and also
support regex input.

katana -u https://tesla.com -cos logout

For multiple out of scope rules, file input with multiline string /
regex can be passed.

$ cat out_of_scope.txt

/logout
/log_out

katana -u https://tesla.com -cos out_of_scope.txt

 -no-scope

Katana is default to scope *.domain, to disable this -ns option can
be used and also to crawl the internet.

katana -u https://tesla.com -ns

 -display-out-scope

As default, when scope option is used, it also applies for the links
to display as output, as such external URLs are default to exclude
and to overwrite this behavior, -do option can be used to display all
the external URLs that exist in targets scoped URL / Endpoint.

katana -u https://tesla.com -do

Here is all the CLI options for the scope control -

katana -h scope

Flags:
SCOPE:
   -cs, -crawl-scope string[]       in scope url regex to be followed by crawler
   -cos, -crawl-out-scope string[]  out of scope url regex to be excluded by crawler
   -fs, -field-scope string         pre-defined scope field (dn,rdn,fqdn) (default "rdn")
   -ns, -no-scope                   disables host based default scope
   -do, -display-out-scope          display external endpoint from scoped crawling

 Crawler Configuration

Katana comes with multiple options to configure and control the crawl
as the way we want.

 -depth

Option to define the depth to follow the urls for crawling, the more
depth the more number of endpoint being crawled + time for crawl.

katana -u https://tesla.com -d 5

 -js-crawl

Option to enable JavaScript file parsing + crawling the endpoints
discovered in JavaScript files, disabled as default.

katana -u https://tesla.com -jc

 -crawl-duration

Option to predefined crawl duration, disabled as default.

katana -u https://tesla.com -ct 2

 -known-files

Option to enable crawling robots.txt and sitemap.xml file, disabled
as default.

katana -u https://tesla.com -kf robotstxt,sitemapxml

 -automatic-form-fill

Option to enable automatic form filling for known / unknown fields,
known field values can be customized as needed by updating form
config file at $HOME/.config/katana/form-config.yaml.

Automatic form filling is experimental feature.

   -aff, -automatic-form-fill  enable optional automatic form filling (experimental)

There are more options to configure when needed, here is all the
config related CLI options -

katana -h config

Flags:
CONFIGURATION:
   -d, -depth int                maximum depth to crawl (default 2)
   -jc, -js-crawl                enable endpoint parsing / crawling in javascript file
   -ct, -crawl-duration int      maximum duration to crawl the target for
   -kf, -known-files string      enable crawling of known files (all,robotstxt,sitemapxml)
   -mrs, -max-response-size int  maximum response size to read (default 2097152)
   -timeout int                  time to wait for request in seconds (default 10)
   -retry int                    number of times to retry the request (default 1)
   -proxy string                 http/socks5 proxy to use
   -H, -headers string[]         custom header/cookie to include in request
   -config string                path to the katana configuration file
   -fc, -form-config string      path to custom form configuration file

 Filters

 -field

Katana comes with build in fields that can be used to filter the
output for the desired information, -f option can be used to specify
any of the available fields.

   -f, -field string  field to display in output (url,path,fqdn,rdn,rurl,qurl,qpath,file,key,value,kv,dir,udir)

Here is a table with examples of each field and expected output when
used -

FIELD    DESCRIPTION                       EXAMPLE
url   URL Endpoint      https://admin.projectdiscovery.io/admin/
                        login?user=admin&password=admin
qurl  URL including     https://admin.projectdiscovery.io/admin/
      query param       login.php?user=admin&password=admin
qpath Path including    /login?user=admin&password=admin
      query param
path  URL Path          https://admin.projectdiscovery.io/admin/login
fqdn  Fully Qualified   admin.projectdiscovery.io
      Domain name
rdn   Root Domain name  projectdiscovery.io
rurl  Root URL          https://admin.projectdiscovery.io
file  Filename in URL   login.php
key   Parameter keys in user,password
      URL
value Parameter values  admin,admin
      in URL
kv    Keys=Values in    user=admin&password=admin
      URL
dir   URL Directory     /admin/
      name
udir  URL with          https://admin.projectdiscovery.io/admin/
      Directory

Here is an example of using field option to only display all the urls
with query parameter in it -

katana -u https://tesla.com -f qurl -silent

https://shop.tesla.com/en_au?redirect=no
https://shop.tesla.com/en_nz?redirect=no
https://shop.tesla.com/product/men_s-raven-lightweight-zip-up-bomber-jacket?sku=1740250-00-A
https://shop.tesla.com/product/tesla-shop-gift-card?sku=1767247-00-A
https://shop.tesla.com/product/men_s-chill-crew-neck-sweatshirt?sku=1740176-00-A
https://www.tesla.com/about?redirect=no
https://www.tesla.com/about/legal?redirect=no
https://www.tesla.com/findus/list?redirect=no

 -store-field

To compliment field option which is useful to filter output at run
time, there is -sf, -store-fields option which works exactly like
field option except instead of filtering, it stores all the
information on the disk under katana_output directory sorted by
target url.

katana -u https://tesla.com -sf key,fqdn,qurl -silent

$ ls katana_output/

https_www.tesla.com_fqdn.txt
https_www.tesla.com_key.txt
https_www.tesla.com_qurl.txt

    Note:

    store-field option can come handy to collect information to build
    a target aware wordlist for followings but not limited to -

  * Most / commonly used parameters
  * Most / commonly used paths
  * Most / commonly files
  * Related / unknown sub(domains)

Here are additonal filter options -

   -f, -field string                field to display in output (url,path,fqdn,rdn,rurl,qurl,file,key,value,kv,dir,udir)
   -sf, -store-field string         field to store in per-host output (url,path,fqdn,rdn,rurl,qurl,file,key,value,kv,dir,udir)
   -em, -extension-match string[]   match output for given extension (eg, -em php,html,js)
   -ef, -extension-filter string[]  filter output for given extension (eg, -ef png,css)

 Rate Limit & Delay

It's easy to get blocked / banned while crawling if not following
target websites limits, katana comes with multiple option to tune the
crawl to go as fast / slow we want.

 -delay

option to introduce a delay in seconds between each new request
katana makes while crawling, disabled as default.

katana -u https://tesla.com -delay 20

 -concurrency

option to control the number of urls per target to fetch at the same
time.

katana -u https://tesla.com -c 20

 -parallelism

option to define number of target to process at same time from list
input.

katana -u https://tesla.com -p 20

 -rate-limit

option to use to define max number of request can go out per second.

katana -u https://tesla.com -rl 100

 -rate-limit-minute

option to use to define max number of request can go out per minute.

katana -u https://tesla.com -rlm 500

Here is all long / short CLI options for rate limit control -

katana -h rate-limit

Flags:
RATE-LIMIT:
   -c, -concurrency int          number of concurrent fetchers to use (default 10)
   -p, -parallelism int          number of concurrent inputs to process (default 10)
   -rd, -delay int               request delay between each request in seconds
   -rl, -rate-limit int          maximum requests to send per second (default 150)
   -rlm, -rate-limit-minute int  maximum number of requests to send per minute

 Output

 -json

Katana support both file output in plain text format as well as JSON
which includes additional information like, source, tag, and
attribute name to co-related the discovered endpoint.

katana -u https://example.com -json -do | jq .

{
  "timestamp": "2022-11-05T22:33:27.745815+05:30",
  "endpoint": "https://www.iana.org/domains/example",
  "source": "https://example.com",
  "tag": "a",
  "attribute": "href"
}

Here are additional CLI options related to output -

katana -h output

OUTPUT:
   -o, -output string  file to write output to
   -j, -json           write output in JSONL(ines) format
   -nc, -no-color      disable output content coloring (ANSI escape codes)
   -silent             display output only
   -v, -verbose        display verbose output
   -version  

---------------------------------------------------------------------
                                  
 katana is made with [?] by the projectdiscovery team and distributed
                         under MIT License.

                            Join Discord

About

A next-generation crawling and spidering framework.

Topics

cli crawler headless web-spider spider-framework gocrawler

Resources

Readme

License

MIT license

Security policy

Security policy

Stars

2.5k stars

Watchers

35 watching

Forks

103 forks

Releases 1

 
v0.0.1 Latest
Nov 7, 2022

Packages 0

No packages published

Contributors 7

  * @ehsandeep
  * @Ice3man543
  * @Mzack9999
  * @LuitelSamikshya
  * @parrasajad
  * @iamnoooob
  * @RohanTheProgrammer

Languages

  * Go 99.8%
  * Dockerfile 0.2%

Footer

 (c) 2022 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact GitHub
  * Pricing
  * API
  * Training
  * Blog
  * About

You can't perform that action at this time.
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session.