https://github.com/quickwit-oss/whichlang Skip to content Toggle navigation Sign up * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions For + Enterprise + Teams + Startups + Education By Solution + CI/CD & Automation + DevOps + DevSecOps Case Studies + Customer Stories + Resources * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Pricing [ ] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this organization All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} quickwit-oss / whichlang Public * Notifications * Fork 1 * Star 64 A blazingly fast and lightweight language detection library for Rust License MIT license 64 stars 1 fork Star Notifications * Code * Issues 2 * Pull requests 1 * Actions * Projects 0 * Security * Insights More * Code * Issues * Pull requests * Actions * Projects * Security * Insights quickwit-oss/whichlang This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags Name already in use A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch? Cancel Create 6 branches 0 tags Code * Local * Codespaces * Clone HTTPS GitHub CLI [https://github.com/q] Use Git or checkout with SVN using the web URL. [gh repo clone quickw] Work fast with our official CLI. Learn more about the CLI. * Open with GitHub Desktop * Download ZIP Sign In Required Please sign in to use Codespaces. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching Xcode If nothing happens, download Xcode and try again. Launching Visual Studio Code Your codespace will open once ready. There was a problem preparing your codespace, please try again. Latest commit @fmassot fmassot Update README.md ... 3874fc4 May 19, 2023 Update README.md 3874fc4 Git stats * 17 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time benches src .gitignore Cargo.toml LICENSE README.md train.ipynb View code Whichlang Features How does it work? Comparison with Whatlang Throughput Accuracy README.md Whichlang This is a language detection library, aiming for both precision and performance. You can read our blog post that introduces the algorithm behing Whichlang. Features * No dependency * Throughput above 100 MB/s for short and long strings. * Good accuracy (99.5% on my validation dataset, but it really depends on the size of your input.) How does it work? It uses a multiclass logistic regression model over: * 2, 3, 4-grams of letters on ASCII * codepoint / 128 * a slightly smarter projection of codepoints over a given class. We use the hashing trick and project these features over a space of size 4_096. The logistic regression is trained in the python notebook attached, and used to generate weight.rs. Comparison with Whatlang The following compares the throughput using the simple benchmark found in this repository and the accuracy using whatlang-accuracy-benchmark benchmark. Overall, Whichlang is about 10x faster and slightly more accurate than Whatlang. Throughput To generate the throughput benchmark, we ported the benchmark available in this repository. Please, check this repository to see our changes. Processing Time (us) Throughput (MiB/s) whatlang/short 16.62 1.66 whatlang/long 62.00 9.42 whichlang/short 0.26 105.69 whichlang/long 5.21 112.31 Accuracy To generate the accuracy benchmark, we have changed the whatlang-accuracy-benchmark to add support for Whichlang. Given that Whatlang supports more languages, we have used its FilterList feature to restrict its analysis to only languages that are supported in Whichlang. We also use the trigram method in Whatlang. Please, check this repository to see our changes. Crate: Whatlang AVG: 91.69% | LANG | AVG | <= 20 | 21-50 | 51-100 | > 100 | |------------|--------|---------|--------|--------|---------| | Arabic | 99.68% | 99.51% | 99.64% | 99.83% | 99.76% | | Mandarin | 96.09% | 97.54% | 96.92% | 95.45% | 94.43% | | German | 88.57% | 70.00% | 88.53% | 96.61% | 99.16% | | English | 85.99% | 57.82% | 88.37% | 97.97% | 99.78% | | French | 90.88% | 72.84% | 92.51% | 98.54% | 99.65% | | Hindi | 99.80% | 100.00% | 99.83% | 99.78% | 99.61% | | Italian | 87.75% | 66.67% | 87.74% | 97.04% | 99.54% | | Japanese | 94.37% | 93.97% | 96.04% | 94.30% | 93.18% | | Korean | 99.17% | 98.88% | 99.69% | 99.44% | 98.66% | | Dutch | 89.68% | 72.13% | 89.78% | 97.40% | 99.40% | | Portuguese | 88.08% | 72.90% | 85.76% | 95.22% | 98.44% | | Russian | 99.98% | 100.00% | 99.96% | 99.98% | 100.00% | | Spanish | 82.91% | 55.45% | 82.24% | 94.85% | 99.10% | | Swedish | 84.16% | 58.33% | 83.78% | 96.35% | 98.18% | | Turkish | 86.73% | 61.01% | 88.94% | 97.32% | 99.63% | | Vietnamese | 93.23% | 82.84% | 92.96% | 97.88% | 99.24% | | AVG | 91.69% | 78.74% | 92.04% | 97.37% | 98.61% | Crate: Whichlang AVG: 97.03% | LANG | AVG | <= 20 | 21-50 | 51-100 | > 100 | |------------|---------|---------|---------|---------|---------| | Arabic | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% | | Mandarin | 98.65% | 98.69% | 98.48% | 98.55% | 98.87% | | German | 94.20% | 80.00% | 97.47% | 99.49% | 99.84% | | English | 97.15% | 91.84% | 97.25% | 99.57% | 99.93% | | French | 97.59% | 93.83% | 97.61% | 99.20% | 99.71% | | Hindi | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% | | Italian | 97.20% | 93.06% | 97.33% | 98.85% | 99.57% | | Japanese | 94.92% | 88.95% | 95.14% | 97.74% | 97.85% | | Korean | 99.83% | 99.44% | 99.98% | 99.97% | 99.94% | | Dutch | 97.08% | 92.84% | 96.98% | 98.91% | 99.60% | | Portuguese | 94.07% | 83.87% | 94.89% | 98.18% | 99.36% | | Russian | 99.92% | 99.69% | 99.99% | 100.00% | 100.00% | | Spanish | 92.12% | 76.36% | 93.78% | 98.65% | 99.70% | | Swedish | 95.37% | 90.28% | 94.94% | 97.76% | 98.51% | | Turkish | 95.51% | 88.24% | 98.11% | 98.38% | 97.33% | | Vietnamese | 98.79% | 96.57% | 98.87% | 99.77% | 99.96% | | AVG | 97.03% | 92.10% | 97.55% | 99.06% | 99.39% | About A blazingly fast and lightweight language detection library for Rust Topics natural-language-processing language-detection rust-lang Resources Readme License MIT license Stars 64 stars Watchers 9 watching Forks 1 fork Report repository Releases No releases published Packages 0 No packages published Contributors 3 * @evanxg852000 evanxg852000 Evance Soumaoro * @fulmicoton fulmicoton Paul Masurel * @fmassot fmassot Francois Massot Languages * Rust 97.7% * Jupyter Notebook 2.3% Footer (c) 2023 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.