Path: news1.ucsd.edu!ihnp4.ucsd.edu!swrinde!newsfeed.internetmci.com!bloom-beacon.mit.edu!senator-bedfellow.mit.edu!faqserv From: andrew@itl.atr.co.jp (Andrew Hunt) Newsgroups: comp.speech,comp.answers,news.answers Subject: comp.speech Frequently Asked Questions - part 1/3 Supersedes: Followup-To: comp.speech Date: 22 Dec 1995 14:10:43 GMT Organization: ATR International, Japan Lines: 1813 Approved: news-answers-request@MIT.Edu Expires: 14 Feb 1996 14:10:32 GMT Message-ID: Reply-To: andrew@itl.atr.co.jp (Andrew Hunt) NNTP-Posting-Host: bloom-picayune.mit.edu Summary: Information on Speech Technology X-Last-Updated: 1995/12/19 Originator: faqserv@bloom-picayune.MIT.EDU Xref: news1.ucsd.edu comp.speech:6602 comp.answers:13224 news.answers:51624 Archive-name: comp-speech-faq/part1 Last-modified: 1995/12/19 URL: http://www.speech.su.oz.au/comp.speech/ COMP.SPEECH FAQ POSTING - PART 1/3 [Note: this document has been automatically extracted from a WWW site: http://www.speech.su.oz.au/comp.speech This may introduce some formatting errors.] COMP.SPEECH FREQUENTLY ASKED QUESTIONS The Frequently Asked Questions (FAQ) is a regular posting to comp.speech which attempts to answer some of the regular questions in the comp.speech newsgroup. The FAQ is not meant to discuss any topic exhaustively. It will hopefully provide readers with pointers on where to find useful information, especially material available on the Internet. If you have not already read the Usenet introductory material posted to news.announce.newusers, please do. For help with FTP (file transfer protocol) look for a regular posting of anonymous FTP FAQ in comp.misc, comp.archives.admin or news.answers. This FAQ is posted every 4 weeks to comp.speech, comp.answers and news.answers. It is also available for ftp from the comp.speech archive site: * ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/FAQ-complete Or from the news.answers ftp site (and its mirrors): * ftp://rtfm.mit.edu/pub/usenet/comp.speech/* Or on the World Wide Web: * Australia: http://www.speech.su.oz.au/comp.speech/ * Britain: http://svr-www.eng.cam.ac.uk/comp.speech/ * Japan: http://www.itl.atr.co.jp/comp.speech/ Or by sending email to mail-server@rtfm.mit.edu with the following line in the body of the message: * send usenet/news.answers/comp-speech-faq/* Finally, if you only have email access to the internet, then I suggest you obtain the Internet-by-email guide. Send email to mail-server@rtfm.mit.edu with the following line in the body of the message: * send usenet/news.answers/internet-services/access-via-email Admin About 20 of the 190 WWW pages for the FAQ have been updated in the last month. Thanks to the many people who sent in information and new entries. Nothing else to report. Acknowledgements Hundreds of people have made contributions to the comp.speech FAQ over the last three years; there are too many to name individually. Special thanks go to Tony Robinson and Joe Campbell who have been particularly helpful. I am grateful to the people at Sydney University, Cambridge University and ATR ITL for supporting the FAQ on their WWW sites. Disclaimer The comp.speech WWW pages are provided as is without any express or implied warranties. While every effort has been taken to ensure the accuracy of the information contained in this article, the author assumes no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. Copyright and Reproduction Copyright (c) 1995 by Andrew Hunt, all rights reserved. The comp.speech WWW pages may not be distributed for financial gain. The comp.speech WWW pages may not be included in any collections or compilations without express permission from the author. Hyperlinks to the comp.speech WWW pages are encouraged. Maintainer The FAQ posting and the Comp.Speech WWW Site are maintained by Andrew Hunt ATR Interpreting Telecommunications Research Laboratories Hikari-dai 2-2, Seika-cho, Kyoto 619-02, Japan andrew@itl.atr.co.jp ___________________________________________________________________________ TABLE OF CONTENTS FAQ SECTION 1: GENERAL INFORMATION ON SPEECH TECHNOLOGY * Q1.1: What is comp.speech? * Q1.2: comp.speech ftp site * Q1.3: Common abbreviations and jargon * Q1.4: Related newsgroups and mailing lists * Q1.5: Related journals and conferences * Q1.6: Handicap Aids * Q1.7: Speech Databases * Q1.8: Speech File Formats and Conversion * Q1.9: Speech Laboratory Environments and Audio Editors * Q1.10: Speech Research Sites * Q1.11: Miscellaneous Software and Resources FAQ SECTION 2: SIGNAL PROCESSING * Q2.1: What sampling do I need for speech? * Q2.2: Finding the pitch of a speech signal * Q2.3: How do I find the start and end points of a speech signal? * Q2.4: Where can I find FFT software? * Q2.5: Signal processing in speech technology * Q2.6: Speech sampling and signal processing hardware * Q2.7: How do I convert to/from mu-law format? FAQ SECTION 3: SPEECH CODING AND COMPRESSION * Q3.1: Speech compression techniques * Q3.2: References on coding/compression * Q3.3: Compression and Coding Software FAQ SECTION 4: NATURAL LANGUAGE PROCESSING * Q4.1: NLP References and Books * Q4.2: NLP Software FAQ SECTION 5: SPEECH SYNTHESIS * Q5.1: What is speech synthesis? * Q5.2: How can speech synthesis be performed? * Q5.3: References/Books on Synthesis * Q5.4: Speech Synthesis on the WWW * Q5.5: Speech Synthesis Software/Hardware FAQ SECTION 6: SPEECH RECOGNITION * Q6.1: What is speech recognition? * Q6.2: How is speech recognition performed? * Q6.3: How can I build a simple speech recogniser? * Q6.4: References & books on speech recognition * Q6.5: Speech Recognition Hardware/Software ___________________________________________________________________________ LIST OF SOFTWARE/HARDWARE/INFORMATION The comp.speech FAQ provides information on a range of software, hardware and resources. Q1.7: Speech Data * Bavarian Archive for Speech Signals * BUPT Spoken Digit Database (Chinese) * Center for Spoken Language Understanding (CSLU) * Examples of IPA Symbols * Linguistic Data Consortium (LDC) * NOISEX * Oxford Acoustic Phonetic Database * Phonemic Samples * RELATOR project Q1.9: Speech Processing Environments * CSRE: Canadian Speech Research Environment * Entropic Signal Processing System (ESPS) and Waves * GoldWave * Kay Elemetrics Computer Speech Lab * Khoros * Matlab plus Signal Processing Toolbox * MacSpeech Lab II * N!Power * OGI Speech Tools * Ptolemy * Signalyze 3.0 * SoundScope Q1.11: Miscelaneous Software and Resources NETWORK "PHONE" SOFTWARE * CyberPhone * FAQ: How can I use the Internet as a telephone? * NetPhone from Electric Magic Company * NEVOT (1.4v) from AT&T BL * Internet Phone from VocalTec AUDIO PROCESSING SOFTWARE * AF version AF3R1 * MixViews * Network Audio System Release 1.1 * NIST Software - SPHERE and SCORE * Sound Processing Kit HUMAN AUDIO PERCEPTION * Auditory Modeller 1 * Auditory Modeller 2 * Auditory Toolbox for Matlab * Human Audio Perception Document DICTIONARIES AND OTHER LEXICAL TOOLS * BEEP dictionary * CMU dictionary * CUVOLAD dictionary * Dictionary * Homophone List * MRC database * Dictionaries on the WWW PHONETIC FONTS * Summer Institute of Linguistics IPA Fonts * Yamada Language Center Q2.6: Audio Hardware * Macintosh Audio Hardware * PC Audio Hardware * Unix Audio Hardware Q3.3: Compression Software and Hardware * 32 kbps ADPCM * CELP 3.2a & LPC * 8 Kbit/s CELP on the TMS320C5x family of DSP chips * File format conversion * G.711/721/723 Compression * G.728 LD-CELP vocoder * G.728 Compression * GSM 06.10 Compression * Lernout & Hauspie Speech Coding (5 products) * Lernout & Hauspie Speech Coding SDK * shorten - a lossless compressor for speech signals * TrueSpeech from DSP Group * U.S.F.S. 1016 CELP vocoder for DSP56001 * ToolVox from Voxware Q4.2: Natural Language Processing * Natural Language Software Registry (NLSR) - NLP Tools * Part of Speech Tagger Q5.5: Speech Synthesis * AsTeR * TheBigMouth * CSRE: Canadian Speech Research Environment * DECTalk * Eloquence * Emacspeak - A Speech Output Subsystem For Emacs * Infovox Product Range * JSRU * Klatt-style synthesiser * KPE80 - A Klatt Synthesiser and Parameter Editor * "learph": Trainable text-to-phoneme software by Antonio Lucca * Lernout and Hauspie Text-To-Speech (3 products) * Lernout and Hauspie Text-To-Speech Windows SDK * Various Mac Speech Output Applications * MacinTalk * Monologue for Windows from First Byte * Narrator Translator Library * Narrator * TextToSpeech Kit (NeXT) * Orator from Bellcore * PAM - A Text-To-Speech Application * ProVerbe Speech Engine for Windows * ProVoice Developer's Speech Toolkit from First Byte * RC Systems V8600/V8601 Text to Speech synthesizers * rsynth * SENSYN speech synthesizer * SGI Developers Toolbox Synthesiser * SIMTEL * Sound Bytes DeveloperUs Kit * spchsyn.exe * Speak * Speech Manager and PlainTalk * Text to Phoneme Program 1 * Text to phoneme program 2 * Text to phoneme program 3 * Tinytalk * TrueTalk * TruVoice from Centigram Q6.5: Speech Recognition * AbbotDemo * BBN Hark Telephony Recognizer * Corona Speech Recognition System * Custom Voice(TM) by A&G Graphics Interface * D6006 Voice Control Processor * DATAVOX - French * Digital Dreams Speech Recognition Plug-Ins * DragonDictate version 3.0 * DragonDictate for Windows * DragonVoiceTools * DSP Semiconductor Recognition Chip * EARS: Single Word Recognition Package * HM2007 - Speech Recognition Chip * Hidden Markov Model Toolkit (HTK) from Entropic * IBM VoiceType Dictation * ICSS system from IBM * IN3 Voice Command * IN3 Voice Command for Windows * Kurzweil Voice for Windows * Lernout & Hauspie ASR (3 products) * Lernout & Hauspie ASR SDK * Listen for Windows 2.0 - Verbex Voice Systems * Lotec Speech Recognition Package * Myers' Hidden Markov Model software * NCC Dictate * OKI VRP6679 - Speech Recognition Chip * Speech Systems Phonetic Engine 500 (PE500) * PowerSecretary * ProNotes Voice Tools (due late '95) * PureSpeech * recnet * SayIt * Simon Says - for NeXT * Speech Commander - Verbex Voice Systems * 'Speech Recognition Expert' Toolkit for Windows * Visual Voice from Stylus Innovation * Voice Command Line Interface * Voice Control Systems Recognition * Visus SpeechKit * VCS 2030 & 2060 Voice Dialer * Voice-Trek 2.0 * Creative VoiceAssist * Voice Blaster Ver. 4.0 * VoiceServer for Windows * Votan * Voice Processing Corporation Speech Recognition Product Line ___________________________________________________________________________ FAQ SECTION 1 - GENERAL * Q1.1: What is comp.speech? * Q1.2: comp.speech ftp site * Q1.3: Common abbreviations and jargon * Q1.4: Related newsgroups and mailing lists * Q1.5: Related journals and conferences * Q1.6: Handicap Aids * Q1.7: Speech Databases * Q1.8: Speech File Formats and Conversion * Q1.9: Speech Laboratory Environments and Audio Editors * Q1.10: Speech Research Sites * Q1.11: Miscellaneous Software and Resources Q1.1: WHAT IS COMP.SPEECH? Comp.speech is an unmoderated newsgroup for discussion of speech technology and speech science. It covers a wide range of issues from the application of speech technology, to research, to products and lots more. By its nature, speech technology is an inter-disciplinary field and the newsgroup reflects this. However, computer application is the basic theme of the group. Note: If you don't know what a newsgroup is, then talk to your local system administration about how to get access. A useful newsgroups for beginners is news.announce.newusers. You might also find the following documents useful. ftp://rtfm.mit.edu/pub/usenet/news.announce.newusers/What_is_Usenet ? ftp://rtfm.mit.edu/pub/usenet/news.announce.newusers/Answers_to_Fre quently_Asked_Questions_about_Usenet ftp://rtfm.mit.edu/pub/usenet/news.announce.newusers/Rules_for_post ing_to_Usenet ftp://rtfm.mit.edu/pub/usenet/news.announce.newusers/FAQs_about_FAQ s The following is a list of some of the topics covered by comp.speech. * Speech Recognition - discussion of methodologies, training, techniques, results and applications. This should cover the application of techniques including HMMs, neural-nets and so on to the field. * Speech Synthesis - discussion concerning theoretical and practical issues associated with the design of speech synthesis systems. * Speech Coding and Compression - both research and application matters. * Phonetic/Linguistic Issues - coverage of linguistic and phonetic issues which are relevant to speech technology applications. Could cover parsing, natural language processing, phonology and prosodic work. * Speech System Design - issues relating to the application of speech technology to real-world problems. Includes the design of user interfaces, the building of real-time systems and so on. * Other matters - relevant conferences, jobs, books, software, hardware, and products. Q1.2: COMP.SPEECH FTP SITE Tony Robinson maintains the comp.speech ftp site. The ftp site is a comprehensive repository of software and information related to speech technology. The site is * ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/ COMP.SPEECH ARCHIVES The comp.speech ftp site provides full archives of the comp.speech newsgroup dating back to the creation of the group in 1991. The postings are stored in the order in which they arrive. Batches of 1000 articles are grouped into gzip'ed tar file. Matching files listing the subjects are also provided. * ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/archive/ SOFTWARE AND OTHER RESOURCES The comp.speech ftp site includes a wide range of useful software and resources. Tony has arranged it into a series of sub-directories: /analysis : Speech analysis software FFT code, a pitch tracker, RASTA code, and IEEE DSP code. /auditory : Auditory model software AIM, Auditory Toolbox and Lutear. /coding : Speech coding software ADPCM, CELP 3.2a, G711, G721, G723, GSM, LDCELP, LPC10, Shorten. /data : Repository for (small) speech-related databases BEEP, CMUDict, Homophone list, hVd database, Peterson Barney database /dictionaries : Phonetic dictionaries BEEP, CMUDict, CUVOALD, Homophone list, MRC database /info : Key postings to comp.speech archives by subject Lots of interesting info! /recognition : Speech recognition software AbbotDemo, Ears, Lotec, recnet, sound blaster recognition, whistle /simtel_sound : Mirror of the simtel/msdos/sound directory Range of useful software /simtel_voice : Mirror of the simtel/msdos/voice directory Another range of useful software /synthesis : Speech synthesis software Klatt synthesis software, Klatt parameter editor and rsynth. /tools : Miscelaneous tools Part-of-speech tagger, OGI speech tools, sox audio file format conversion, SPHERE software and more. Q1.3: COMMON ABBREVIATIONS AND JARGON. * ANN - Artificial Neural Network. * ASR - Automatic Speech Recognition. * ASSP - Acoustics Speech and Signal Processing * AVIOS - American Voice I/O Society * CELP - Code-book Excited Linear Prediction. * COLING - COmputational LINGuistics * DTW - Dynamic Time Warping. * FAQ - Frequently Asked Questions. * HMM - Hidden Markov Model. * IEEE - Institute of Electrical and Electronics Engineers * JASA - Journal of the Acoustic Society of America * LPC - Linear Predictive Coding. * LVQ - Learned Vector Quantisation. * NLP - Natural Language Processing. * NN - Neural Network. * TI - Texas Instruments. * TIMIT - A large speech corpus from TI and MIT - see Q1.7 * TTS - Text-To-Speech (i.e. synthesis). * VQ - Vector Quantisation. Q1.4: RELATED NEWSGROUPS AND MAILING LISTS. Newsgroups comp.ai - Artificial Intelligence newsgroup. Postings on general AI issues, language processing and AI techniques. The comp.ai FAQ covers NLP, NN and other AI information. comp.ai.nat-lang - Natural Language Processing Group Postings regarding Natural Language Processing. Set up to cover a broard range of related issues and different viewpoints. A comp.ai.nat-lang FAQ posting is available. comp.ai.nlang-know-rep - Natural Language Knowledge Representation Moderated group. comp.ai.neural-nets - discussion of Neural Networks and related issues. There are often posting on speech related matters - phonetic recognition, connectionist grammars and so on. A comp.ai.neural-nets FAQ posting is available. comp.compression - occasional articles on compression of speech. The comp.compression FAQ has some info on audio compression standards. comp.dcom.telecom - Telecommunications newsgroup. Has occasional articles on voice products. comp.dsp - discussion of signal processing - hardware and algorithms and more. Has a good FAQ posting which is also available on the WWW and by ftp (addresses below). Has a regular posting of a comprehensive list of Audio File Formats. + http://www.bdti.com/dsp_faq.htm + ftp://rtfm.mit.edu/pub/usenet/comp.dsp/ comp.multimedia - Multi-Media discussion group. Has occasional articles on voice I/O. sci.lang - Language. Discussion about phonetics, phonology, grammar, etymology and lots more. A sci.lang FAQ is available. alt.sci.physics.acoustics Some discussion of speech production & perception. alt.binaries.sounds.* - posting and discussion of sound samples. Mailing Lists [There are many other mailing lists which are not mentioned here. If you know of one which should be included in the list, then please submit it.] ECTL - Electronic Communal Temporal Lobe Founder & Moderator: David Leip. Moderated mailing list for researchers with interests in computer speech interfaces. This list serves a broad community including persons from signal processing, AI, linguistics and human factors. To subscribe, send your name, institute, department, daytime phone and email address to: + ectl-request@snowhite.cis.uoguelph.ca The ECTL archive site is ftp://snowhite.cis.uoguelph.ca/pub/ectl Prosody Mailing List Unmoderated mailing list for discussion of prosody. The aim is to facilitate the spread of information relating to the research of prosody by creating a network of researchers in the field. If you want to participate, send the following one-line message to + listserv@msu.edu + subscribe prosody Your Name foNETiks A moderated monthly newsletter distributed by e-mail. It carries job advertisements, notices of conferences, and other news of general interest to phoneticians, speech scientists and others. The editors are Linda Shockey and Gerry Docherty. To subscribe send the following 1 line message to + mailbase@mailbase.ac.uk + join fonetiks your_first_name your_second_name Digital Mobile Radio Covers lots of areas include some speech topics including speech coding and speech compression. Mail Peter Decker dec@dfv.rwth-aachen.de to subscribe. Q1.5: RELATED JOURNALS AND CONFERENCES [Note: Also see the list provided in Shikano's WWW site on Speech and Acoustics: http://www.aist-nara.ac.jp/IS/Shikano-lab/database/internet-resource/e-ww w-site.html.] Product Oriented Magazines * Voice News - monthly industry newsletter Stoneridge Technical Services PO Box 1891, Rockville, MD, 20850, USA Phone: (301) 424-0114 * Voice Technology News * Voice Processing Magazine (1-800-854-3112) * Speech Technology (no longer published) Technical Journals (There are some contact addresses below.) * Computer Speech and Language * Speech Communication * IEEE Transactions on Speech and Audio Processing * IEEE Signal Processing Magazine * IEEE Transactions on Acoustics, Speech, and Signal Processing (ASSP) (now obsolete) * Computational Linguistics (COLING) * Journal of the Acoustical Society of America (JASA) * AVIOS Journal * ASR News Conferences * ICASSP: Intl. Conference on Acoustics Speech and Signal Processing (IEEE) * ICSLP: Intl. Conference on Spoken Language Processing * EUROSPEECH: European Conference on Speech Communication and Technology * AVIOS: American Voice I/O Society Conference * SST: Australian Speech Science and Technology Conference Some Contacts Addresses Institute of Electrical and Electronics Engineers (IEEE) For IEEE Transactions on Speech and Audio Processing (from Jan 93) and IEEE Transactions on Acoustics, Speech, and Signal Processing (ASSP) - now obsolete. IEEE Service Center 445 Hoes Lane, PO Box 1331, Piscataway, NJ 08855, USA Phone: 1-800-678-IEEE or (201)981-0060 Harcourt Brace and Company Ltd. For Computer Speech and Language Price: $US170 (Institutions), $US75 (Individuals), 4 times per year. High Street, Foots Cray, Sidcup Kent, DA14 SHP, England Association for Computational Linguistics For Computational Linguistics MIT Press Journals 55 Hayward St, Cambridge, MA 02142, USA Phone: (617)253-2889 Q1.6: HANDICAP AIDS Can anyone provide information on speech technology aids for the deaf, blind, speech impaired, physically impaired or others who may benefit from speech technology? SpeechViewer II * Platform: IBM Machines from Mod 25 on. * Description: SpeechViewer II is a speech therapy tool. It provided graphical feedback of various speech features so that speech impaired individuals can improve their speech. It works with an audio bandwidth of 7.3 Khz and thus allows the therapist to work with sustained vowels and fricatives. A wide range of graphics are used to provide adequate variability to hold client interest. An extensive set of statistics are gathered which allows a therapist to do research or keep therapy records. The speech therapy modules are: + Awareness - Sound, Loudness, Pitch, Voicing Onset, Voicing + Skill Building - Pitch, Voicing, Phonology + Patterning - Pitch & Loudness - Waveform & Spectrogram, Spectra + Clinical Management - Profiles, Models, Client Data * Hardware: Requires an IBM M-ACPA (Multimedia-Audio Capture Playback Adapter). It has a TI TMS320C25 DSP chip. The input sampling rate is 44.1 Khz stereo, 88.2 Khz mono. This is a 16 bit card. It has the following jacks: mic in, stereo line in, stereo line out, speaker out. Note: This card is being replaced by Mwave technology. For more info on Mwave contact Texas Instruments. * Price: + The software is $2130 list, $1491 educational, part number 92F2066. + The M-ACPA is $370 list, $222 educational, part number 92F3378. + The MicroChannel adapter part number is 92F3379 (same price). * Contact: The Psychological Corporation (TPC) [IBM Authorized Remarketer] Phone: 1-800-228-0752 or contact IBM on 1-800-426-4832. Q1.7: SPEECH DATABASES A wide range of speech databases have been collected. These databases are primarily for the development of speech synthesis/recognition and for linguistic research. Some databases are free but most are not. The databases normally require lots of storage space (100's of MBytes is not unusual). Do not expect to be able to ftp large amounts of speech data. In addition to the descriptions of speech databases and speech database providers below, information can be obtained from LDC: Linguistic Data Consortium Provides a very wide range of speech and text data to research and commercial users: see below. COCOSDA Home Page: http://www.itl.atr.co.jp/cocosda/ The International Committee for the Co-ordination and Standardisation of Speech Databases and Assesment Techniques for Speech Input/Output. Shikano's WWW site on Speech and Acoustics http://www.aist-nara.ac.jp/IS/Shikano-lab/database/internet-resource /e-www-site.html RELATOR Project European resource initiative: see below. The following speech data resources are described in the FAQ. * Bavarian Archive for Speech Signals * BUPT Spoken Digit Database (Chinese) * Center for Spoken Language Understanding (CSLU) * Examples of IPA Symbols * Linguistic Data Consortium (LDC) * NOISEX * Oxford Acoustic Phonetic Database * Phonemic Samples * RELATOR project Bavarian Archive for Speech Signals * Description: The Bavarian Archive for Speech Signals (BAS) was founded in January 1995 as an initiative of the Institute of Phonetics at the University of Munich, Germany. The BAS will develop, validate, administrate and disseminate corpora of spoken German to the speech community as well as to speech engineering industry. Presently the following German speech corpora are available on ISO 9660 CDROM: Siemens 1000 - SI1000 5 CDROMs, newspaper corpus, read speech, 10 speakers x 1000 utterances Siemens 100 - SI100 7 CDROMs, read speech, 101 speakers x 100 sentences PhonDat 1 - PD1 6 CDROMs, new edition in preparation, read speech, 201 speakers x 450+ sentences PhonDat 2 - PD2 1 CDROM, read speech, 2nd edition, 16 speakers x 200 sentences, various labelled information Verbmobil Spontaneous speech recorded in a dialog task (appointment scheduling). More information on the VERBMOBIL project: http://www.dfki.uni-sb.de/verbmobil Corpora in Preparation PhonDat I - PD1: 2nd extended edition (Jul 1995) Strange Corpora - SC Reference Corpora that reflect certain well known problems in speech processing, like accents, repair, breaks, hesitations, repetitions, extreme F0, backround noise, pathological speech, speaker adaptation. The first SC corpus (SC1 Accents) will be edited in Jul 1995. BAS Edition of Verbmobil Corpora - VM: 2nd extended edition Articulatory data - AD: EMA data of speakers of SI1000 corpus ERBA: 10000 utterances from a train inquiry task * Misc: BAS is currently developing tools for the automatic annotation and segmentation of very large speech corpora. This includes the automatic detection of variants of pronunciation, a statistical based alignment and a rule-based refinement of the outcome. The BAS seeks to cooperate with public institutions as well as with industrial partners to further develop new German speech databases. BAS can be a platform to re-distribute existing German speech. * Contact and More Information: The BAS is located at the University of Munich, Germany. BAS c/o Institut fuer Phonetik Schellingstr. 3/II 80799 Muenchen Germany Ph: +49-89-21802758 Fax: +49-89-2800362 email: bas@sun1.phonetik.uni-muenchen.de WWW: http://www.phonetik.uni-muenchen.de/BASSeng.html BUPT Spoken Digit Database (Chinese) * Vocabulary : {0, 1/yi/, 2, 3, 4, 5, 6, 7, 8, 9, 1/yao/, /dui/, /cuo/ }, 13 words in total. * Size: 1202 speakers in total, 789 Males and 413 Females. Each speaker utters each word 2 times. Total of 31252 utterances. * Format: 8000Hz 14bit sampling. One utterance per file. * Contact: GLuck Co. 195 Berlioz 1C, Nun's Island Verdun H3E 1C1, Canada e-mail: weigang@zaphod.math.mcgill.ca Center for Spoken Language Understanding (CSLU) * The ISOLET speech database of spoken letters of the English alphabet. The speech is high quality (16 kHz with a noise cancelling microphone). 150 speakers x 26 letters of the English alphabet twice in random order. The ISOLET data base can be purchased for $100 by sending an email request to vincew@cse.ogi.edu. (This covers handling, shipping and medium costs). The data base comes with a technical report describing the data. * CSLU has a telephone speech corpus of 1000 English alphabets. Callers recite the alphabet with brief pauses between letters. This database is available to not-for-profit institutions for $100. The data base is described in the proceedings of the International Conference on Spoken Language Processing. + Contact vincew@cse.ogi.edu if interested. * CSLU has released for universities its Continuous English Speech Corpus. The corpus contains recorded speech from 690 different speakers, with label files at various levels - including word level and phonetic labels. The data were collected as part of the OGI Multi-language telephone corpus. CSLU provides speech corpora to all universities without charge. To order a corpus, print the license agreement/order form, complete it, and fax it to the CSLU. A description of the corpora and an order form are available by anonymous ftp: ftp://speech.cse.ogi.edu/pub/releases * Contact: Mike Noel email: noel@cse.ogi.edu Phone: (503) 690-1309 Examples of IPA Symbols UCLA SOUNDS OF THE WORLD'S LANGUAGES * Description: The UCLA Sounds of the World's Languages are available for Macintosh users (no DOS based system currently available). The sounds are stored in a Hypercard database developed at the UCLA Phonetics Laboratory. The aim is to illustrate and teach about the range of sounds used in human languages with material on more than 80 languages. The set demonstrates particular highlights of the sound systems focusing especially on rarer sounds that students may not otherwise have a chance to hear from a native speaker. The recordings are based on the archives of recordings collected at UCLA, with additional contributions from outside collaborators. All the languages can be accessed from the list of language names, or by clicking on the language name in a set of maps. Support for part of this work was provided by NSF. The database currently includes examples of languages from Agul and Akan to Zulu. * Availability: 15 DSDD disks, requiring about 35 meg of disk space when expanded. Available for $50 individual $100 institutions. Prepayment in US dollars (checks or international money orders payable to "UC Regents") must accompany all orders. * Contact: The UCLA Phonetics Laboratory Linguistics Department, UCLA, Los Angeles, CA 90095 1543 Tel: (310) 825-1254 E-mail: oldfogey@ucla.edu JOHN ESLINGS "IPA LABELS" * Description: A HyperCard stack which is available for free or a nominal fee. * Contact: John Esling can be reached by email: pdb@uvvm.uvic.ca. Linguistic Data Consortium (LDC) The LDC was established to broaden the collection and distribution of speech and natural language data bases for the purposes of research and technology development in automatic speech recognition, natural language processing and other areas where large amounts of linguistic data are needed. Detailed information on the LDC is now available on the WWW: http://www.cis.upenn.edu/~ldc/home.html. The LDC WWW server provides information on membership agreements, license agreements, and summaries of speech and text corpora available. Speech Corpora * TIMIT Acoustic-Phonetic Continuous Speech Corpora and NYNEX Telephone Version of TIMIT Corpus (NTIMIT) * Resource Management Corpora * Air Travel Information System (ATIS) Corpora (multiple) * ARPA Continuous Speech Recognition Corpora (WSJ etc) * Switchboard Corpus of Recorded Telephone Conversations and Switchboard Corpus Excerpts (Credit Card Conversations) * Texas Instruments 46-Word Speaker-Dependent Isolated Word Corpus (TI46) * Texas Instruments Speaker-Independent Connected-Digit Corpus (TIDIGITS) * Road Rally Conversational Speech Corpus * HCRC Map Task Corpus * Air Traffic Control Corpus (ATC0) * SPIDRE Speaker Identification Corpus * YOHO Speaker Verification Corpus * OGI Multi-Language Corpus and OGI Spelled and Spoken Telephone Corpus * BRAMSHILL * MACROPHONE * King Corpus for Speaker Verification Research * WSJCAM0: Cambridge Read News Corpus * TRAINS Spoken dialog corpus * NYNEX PhoneBook Database * Frontiers in Speech Processing Text Corpora * Association for Computational Linguistics Data Collection Initiative (ACL/DCI) * The Penn Treebank Project - Release 2 * TIPSTER Information Retrieval Text Research Collection * United Nations Parallel Text Corpus (English, French, Spanish) * Japanese Language Financial New * European Corpus Initiative-1 Lexical Databases * CELEX Lexical Database * COMLEX : COMmon LEXical Database of English (English syntax and pronunciation) For more information: Contact: Linguistic Data Consortium 441 Williams Hall, University of Pennsylvania, Philadelphia, PA 19104-6305, USA Phone: +1 (215) 898-0464 Fax: +1 (215) 573-2175 e-mail: ldc@unagi.cis.upenn.edu WWW: http://www.cis.upenn.edu/~ldc/home.html Anonymous ftp: ftp://ftp.cis.upenn.edu/pub/ldc/ NOISEX-92 * Description: Database of recording of various noises available on 2 CDROMs. Some material from the same source is available by anonymous ftp in the IEEE's Signal Processing Information Base. The samples include + Voice babble + Factory noise + HF radio channel noise, pink noise, white noise + Various military noises; fighter jets (Buccaneer, F16), destroyer noises (engine room, operations room), tank noise (Leopard, M109), machine gun + Volvo 340 * Availability 1: The cost of this database is 135 Pounds Sterling for the set of two CD-ROMs. Send payment with order to: The Speech Research Unit, Ex1, DRA Malvern, St.Andrew's Road, Malvern, Worcestershire, WR14 3PS, UK Tel +44-684-894074 Fax +44-684-894384 Note: The supply of CD-ROMs is limited so please check that they are still available before placing an order. The only acceptable methods of payment are cheques (from the UK only) or bank drafts in Pounds Sterling drawn on a UK bank. They should be made payable to:- Public Sub Account HMG 4768. * Availability 2: Information on how to obtain a copy of the NATO RSG.10 NOISE-ROM-0 can be obtained from the DRA Speech Research Unit (address above) or from: Dr. Herman Steeneken, TNO Institute for Perception, P.O. Box 23, 3769 ZG Soesterberg, The Netherlands. * Examples: The IEEE samples of the NOISEX database are available by anonymous ftp (the data files average around 10MB). ftp://bellona.cs.rice.edu/spib/data/noise/ Oxford Acoustic Phonetic Database * Available on compact disc, from J. Pickering and B. Rosner. It contains data on vowel-consonant and consonant-vowel combinations in both stressed and unstressed locations. The language covered include French, German, Hungarian, Italian, Japanese, British English, Spanish and English. For further information write to Electronic Publishing, Oxford University Press, Walton Street, Oxford OX2 6DP, UK. The ISBN is 0-19-268086-2 * Contact: Prof. B. Rosner Dept. of Experimental Psychology South Parks Rd, Oxford, OX1 3UD, UK email: burton.rosner@wolfson.ox.ac.uk Phonemic Samples * Some basic data. The following ftp sites have samples of English phonemes (American accent I believe) in Sun audio format files. See Question 1.8 for information on audio file formats. ftp://sounds.sdsu.edu/.1/phonemes: This ftp site appears to be obsolete. Does anyone know a new address? ftp://phloem.uoregon.edu/pub/Sun4/lib/phonemes: There appears to be some config problem with this ftp server. ftp://sunsite.unc.edu/pub/multimedia/sun-sounds/phonemes The RELATOR project * Description: RELATOR is a European-wide consortium of researchers who, with the support of the European Commission, are striving to establish a European repository of linguistic resources. Linguistic resources comprise a variety of spoken and written language materials, including lexicons, grammars, corpora, and spoken language databases. RELATOR will ensure that the requirements of the European language processing community receive attention. The RELATOR WWW pages provide information on the consortium, The languages currently covered by the RELATOR consortium include Danish, Dutch, English, French, German, Greek, Italian, Portuguese, Spanish plus multilingual resources. The resources include both text and speech. * WWW: http://cristal.icp.grenet.fr/Relator/homepage.html Q1.8: SPEECH FILE FORMATS AND CONVERSION Q2.7 of this FAQ has information on mu-law coding. A very good and very comprehensive list of audio file formats is prepared by Guido van Rossum. The list is posted regularly to comp.dsp and alt.binaries.sounds.misc, amongst others. It includes information on sampling rates, hardware, compression techniques, file format definitions, format conversion, standards, programming hints and lots more. It is also available by ftp from ftp://ftp.cwi.nl/pub/audio/AudioFormats.part1,2 Q1.9: SPEECH LABORATORY ENVIRONMENTS AND AUDIO EDITORS First, what is a Speech Laboratory Environment? A speech lab is a software package which provides the capability of recording, playing, analysing, processing, displaying and storing speech. Your computer will require audio input/output capability. The different packages vary greatly in features and capability - best to know what you want before you start looking around. Most general purpose audio editing packages will be able to process speech but do not necessarily have some specialised capabilities for speech (e.g. formant analysis). The following article provides a good survey. * Read, C., Buder, E., & Kent, R. "Speech Analysis Systems: An Evaluation" Journal of Speech and Hearing Research, pp 314-332, April 1992. The following is a list of the speech labs described in the FAQ. * CSRE: Canadian Speech Research Environment * Entropic Signal Processing System (ESPS) and Waves * GoldWave * Kay Elemetrics Computer Speech Lab * Khoros * Matlab plus Signal Processing Toolbox * MacSpeech Lab II * N!Power * OGI Speech Tools * Ptolemy * Signalyze 3.0 * SoundScope CSRE: Canadian Speech Research Environment * Platform: IBM/AT-compatibles * Description: CSRE is a microcomputer-based system designed to support speech research. CSRE provides a low-cost facility in support of speech research, using mass-produced and widely-available hardware. The project is non-profit, and relies on the cooperation of researchers at a number of institutions and fees generated when the software is distributed. Functions include speech capture, editing, and replay; several alternative spectral analysis procedures, with color and surface/3D displays; parameter extraction/tracking and tools to automate measurement and support data logging; alternative pitch-extraction systems; parametric speech (KLATT80) and non-speech acoustic synthesis, with a variety of supporting productivity tools; and an experiment generator, to support behavioral testing using a variety of common testing protocols. A paper about the whole package can be found in: + Jamieson D.G. et al, "CSRE: A Speech Research Environment", Proc. of the Second Intl. Conf. on Spoken Language Processing Edmonton: University of Alberta, pp. 1127-1130. * Hardware: Can use a range of data aqcuisition/DSP hardware * Cost: Distributed on a cost recovery basis. * Availability: For more information on availability contact AVAAZ Innovations Inc. P.O.Box 8040 1225 Wonderland Rd. N London, Ontario, CANADA, N6G 2B0 Tel : (519) 472-7944 Fax : (519) 472-7814 Email: info@avaaz.com * Note: Also included in Q5.5 on speech synthesis packages. Entropic Signal Processing System (ESPS) and Waves * Platform: Range of Unix platforms. * Description: ESPS is a comprehensive set of speech analysis/processing tools for the UNIX environment. The package includes UNIX commands, and a comprehensive C library (which can be accessed from other languages). Waves is a graphical front-end for speech processing. Speech waveforms, spectrograms, pitch traces etc can be displayed, edited and processed in X windows and Openwindows (versions 2 & 3). Waves also includes a signal labelling utility which provides multiple feature labelling and useful features for fast labelling of large speech databases. Other Entropic products are HTK (see Q6.5) and TrueTalk (see Q5.5). * Misc: A more detailed description is provided on the Entropic WWW pages (http://www.entropic.com/esps.html). * Cost: On request. * Contact: Entropic Research Laboratory, Washington Research Laboratory 600 Pennsylvania Ave, S.E. Suite 202, Washington, D.C. 20003 (202) 547-1420 email: info@entropic.com WWW: http://www.entropic.com/ GoldWave * Platform: Windows * Description: GoldWave is a digital audio editor for Microsoft Windows. It features realtime amplitude/spectrum oscilloscopes, large file editing, effects, and support for a wide variety of sound formats. + Editing of multiple waveforms and large waveforms + Realtime amplitude/spectrum oscilloscopes + Resizable device controls window for accessing audio devices + Realtime fast forward and rewind playback + Effects: distortion, Doppler, echo, filter, mechanize, offset, pan, volume shaping, invert, resample, transpose, etc + Multiple file formats and conversions: .WAV, .AU, .IFF, .VOC, .SND, .MAT, .AIFF, and raw data + CD-ROM controls window More information is available on the GoldWave home page. * Cost: Shareware * Availability: Through the GoldWave home page: http://web.cs.mun.ca/~chris3/goldwave/goldwave.html * Contact: Chris Craig: chris3@cs.mun.ca Kay Elemetrics CSL (Computer Speech Lab) 4300 * Platform: Minimum IBM PC-AT compatible with extended memory (min 2MB) with at least VGA graphics. Optimal would be 386 or 486 machine with more RAM for handling larger amounts of data. * Description: Speech analysis package, with optional separate LPC program for analysis/synthesis. Uses its own file format for data, but has some ability to export data as ascii. The main editing/analysis prog (but not the LPC part) has its own macro language, making it easy to perform repetitive tasks. Probably not much use without the extra LPC program, which also allows manipulation of pitch, formant and bandwidth parameters. Hardware includes an internal DSP board for the PC (requires ISA slot), and an external module containing signal processing chips which does A/D and D/A conversion. * Misc: A programmers kit is available for programming signal processing chips (experts only). A speaker and microphone are supplied. Manuals are included. * Cost: Recently approx 6000 pounds sterling. * Contact: UK distributors are Wessex Electronics, 114-116 North Street, Downend, Bristol, B16 5SE Tel: 0272 571404. In the USA contact: Kay Elemetrics Corp, 12 Maple Avenue, PO Box 2025, Pine Brook, NJ 07058-9798 Tel:(201) 227-7760 Khoros * Description: Public domain image processing package with a basic DSP library. Not particularly applicable to speech, but not bad for the price. * Cost: Free * Availability: By anonymous ftp from ftp://pprg.eece.unm.edu Matlab plus Signal Processing Toolbox * Platform: Wide range * Description: Matlab (MATrix LABoratory) is a technical computing environment for numerical computation and visualization based on a matrix oriented, interpreted programming language. The programming environment provides support for the development of customized operations, along with debugging facilities and a graphical user interface toolkit. Audio output is provided. A specialised Signal Processing Toolbox is available which provides many functions which are useful for speech analysis. It includes filter design, spectral estimation, statistical signal processing, waveform generation, and signal and spectrogram display. A specialised Auditory Toolbox is available which contains functions useful to people interested in auditory/cochlear models. A more detailed description is given in Q1.10. * Price: On request. * Contact: The Math Works Inc. 24 Prime Park Way, Natick, MA 01760-1500 USA Ph: 1-508-653 1415 Fax: 1-508-653 6284 Email: info@mathworks.com ftp://ftp.mathworks.com WWW: http://www.mathworks.com/ MacSpeech Lab II (MSL II) * Platform: Macintosh * Description: A sound analysis and acquisition for Macs. MSL II delivers the most common functions for speech analysis (FFTs, LPCs, f0 extraction, etc.) & produces grayscale spectrographic displays. Can be used for various speech technology and phonetic training tasks. * Hardware: Requires MacADIOS ("Macintosh Analog/Digital Input/Output System") hardware for speech I/O at 12/16 bits. * Misc: Software no longer updated by GW Instruments; MSL soft/hardware will not perform input/output on Quadras, for example, though analysis seems fine. Known to operate properly on systems as high as IIcx & II fx. * Availability: MSL has been replaced by SoundScope; see the SoundScope entry for more detail. * Contact: GW Instruments 35 Medford Street, Somerville, MA 02143, USA Phone: (617) 625-4096 Fax: (617) 625-1322 N!Power * Platform: SUN, DEC and HP workstations. * Description: An object-oriented software package with a MOTIF GUI interface and a range of functionality for data analysis/editing, signal analysis, speech processing, real-time A/D and D/A, and 2D/3D interactive graphics. N!Power replaces ILS. N!Power can provide a Block Diagram user interface, menus, pop-ups, and a high-level IEEE standard symbolic scripting language. You can customize the blocks, menus and pop-ups with mouse point-and-click operations. * Contact: Signal Technology, Inc. 104 W. Anapamu, Suite J, Santa Barbara, CA 93101-3126 Phone: 805-899-8300 FAX: 805-899-4344 email: larry@signal.com OGI Speech Tools * Developers from the Center for Spoken Language Understanding (CSLU) at the Oregon Graduate Institute of Science and Technology (Portland Oregon) * Platform: Unix * Description: The OGI Speech tools include : + An X windows display tool (LYRE) for displaying data in a time synchronous fashion for a. the speech signal b. spectrograms c. phoneme labels, and other information. + A Neural Network (NOPT) training package. + An set of C library routines (LIBNSPEECH) for the manipulation of speech data, including: a. PLP Analysis, b. Rasta PLP Analysis, c. Linear Predictive Coding, d. Mel Cepstrum Coding, e. Fast Fourier Transform + A set of utilities for converting file formats such as ADC, NIST, mu-law, binary files, and ascii. Includes filtering. + A database utility (find_phone) to automate speech database related enquiries. It allows the user to specify a particular label or set of labels in a given context, display all occurrences of the label, and relabel the occurrences if desired. + A Vector-Quantizer based on the Linde Buzo and Gray (LBG) algorithm. + A set of PERL Scripts which have been used mainly to automate the use of the OGI Speech Tools. + MAN Pages for all routines and programs developed, as well as a User manual in both in postscript and tex format. * Misc: Software is written in ANSI C. * Availability: By anonymous ftp from ftp://speech.cse.ogi.edu/pub/tools/ * Contact: Try tools@cse.ogi.edu Ptolemy * Platform: Sun SPARC, DecStation (MIPS), HP (hppa). * Description: Ptolemy provides a highly flexible foundation for the specification, simulation, and rapid prototyping of systems. It is an object oriented framework within which diverse models of computation can co-exist and interact. Ptolemy can be used to model entire systems. Ptolemy has been used for a broad range of applications including signal processing, telecomunications, parallel processing, wireless communications, network design, radio astronomy, real time systems, and hardware/software co-design. Ptolemy has also been used as a lab for signal processing and communications courses. Ptolemy has been developed at UC Berkeley over the past 3 years. Further information, including papers and the complete release notes, is available from the FTP site. * Cost: Free * Availability: The source code, binaries, and documentation are available by anonymous ftp from ftp://ptolemy.berkeley.edu/pub/README Signalyze 3.0 from InfoSignal * Platform: Macintosh * Description: Signalyze's basic conception revolves around up to 100 signals, displayed synchronously in HyperCard fashion on "cards". The program offers a complement of signal editing features, quite a few spectral analysis tools, manual scoring tools, pitch extraction routines, a good set of signal manipulation tools, and extensive input-output capacity. Handles multiple file formats: Signalyze, MacSpeech Lab, AudioMedia, SoundDesigner II, SoundEdit/MacRecorder, SoundWave, three sound resource formats, and ASCII-text. Sound I/O: Direct sound input from MacRecorder and similar devices, AudioMedia, AudioMedia II and AD IN, some MacADIOS boards and devices, Apple sound input (built-in microphone). Sound output via Macintosh internal sound, via SoundManager 3.0, some MacADIOS boards and devices as well as via the Digidesign 16-bit boards. It has a range of capabilities for creating, editing and manipulating label files with flexibility in labelling format. * Compatibility: MacPlus and higher (including II, IIx, IIcx, IIci, IIfx, IIvx, IIvi, Portable, all PowerBooks, Centris and Quadras). Takes advantage of large and multiple screens and 16/256 color/grayscales. System 7.0 compatible. Runs in background with adjustable priority. * Misc: A demo available upon request. Manuals and tutorial included. It is available in English, French, and German. An UPDATER to version 2.48 is now available in: + The UNIL Gopher server (see last page of InfoSignal News 8)gopher.agoralang.com + The LAIP FTP server. Address: MACFL4082.unil.ch [130.223.104.31] Also available are a demo program, and current questions and answers. * Cost: Individual licence US$350, site license US$500, plus shipping. Upgrades from version 2.0 are available. * Contact: North America - Network Technology Corporation 91 Baldwin St., Charlestown MA 02129 Fax: 617-241-5064 Phone: 617-241-9205 Elsewhere, contact InfoSignal Inc. C.P. 73, 1015 LAUSANNE, Switzerland, FAX: +41 21 691-1372, Email: 76357.1213@COMPUSERVE.COM. SoundScope * Platform: Macintosh: 68K and PowerPC native * Description: The SoundScope product family is used primarily in speech teaching & research, with some applications in animal sounds, forensics, and general acoustic analysis. It can record, view, analyze, play, copy, paste, store and print sound waveforms. Analysis functions include spectrogram, fundamental frequency (Fo), Linear Predictive Coding (LPC) including formant tracking, LPC residual, jitter (pitch perturbation), shimmer (amplitude perturbation), HNR, frequency spectrum, spectral slice, envelope, energy and zero crossing. Includes limited built-in filtering, runs any filter created with WLFDAP. An integrated text editor stores notes and calculation results. SoundScope lets you design your own custom "instrument" screen, tasks (macros) and menus. Supplied instruments include 1 channel analyser (dual snap, dual time, spectrogram, spectrum), 2 channel analyser, segment analyser, multi-channel recorder, etc. * Note: Supercedes MacSpeech Lab II. * Price: $490 to $4990, less educational discount * Availability: In North America, directly from GW Instruments. Contact the company for international distributors. * Contact: GW Instruments 35 Medford Street, Somerville, MA 02143, USA Phone: (617) 625-4096 Fax: (617) 625-1322 Email: D0268@Applelink.Apple.COM Q1.10: SPEECH RESEARCH SITES Rather than try to list the places round the world which perform speech research this FAQ lists sites on the WWW where other comprehensive lists are maintained. Try the following: Shikano's WWW site on Speech and Acoustics http://www.aist-nara.ac.jp/IS/Shikano-lab/database/internet-resource /e-www-site.html Lists of speech research sites by country. Currently includes around 100 sites. The list of Japanese sites is particularly comprehensive. Mambo Speech Research List http://mambo.ucsc.edu/psl/speech.html Lists about 50 speech research sites and related information sources. Very nice presentation! ESCA: European Speech Communication Association http://ophale.icp.grenet.fr/esca/labos.html Links to around 15 European speech research sites and around 15 related sources of information. Russ Wilcox's list of Commercial Speech Recognition http://www.tiac.net/users/rwilcox/speech.html Links to information on speech technology vendors, speech research labs, speech resources, on-line demos and more. Most speech research sites have links to other speech research sites somewhere in their WWW pages. You can keep following those link (till you go round in circles). Q1.11: MISCELLANEOUS SOFTWARE AND RESOURCES. SPEECH INTERFACE STANDARDS: APIS ETC (ANY ADDITIONS?) * Microsoft Speech API NETWORK "PHONE" SOFTWARE * CyberPhone * FAQ: How can I use the Internet as a telephone? * NetPhone from Electric Magic Company * NEVOT (1.4v) from AT&T BL * Internet Phone from VocalTec AUDIO PROCESSING SOFTWARE * AF version AF3R1 * MixViews * Network Audio System Release 1.1 * NIST Software - SPHERE and SCORE * Sound Processing Kit HUMAN AUDIO PERCEPTION * Auditory Modeller 1 * Auditory Modeller 2 * Auditory Toolbox for Matlab * Human Audio Perception Document DICTIONARIES AND OTHER LEXICAL TOOLS * BEEP dictionary * CMU dictionary * CUVOLAD dictionary * Dictionary * Homophone List * MRC database * Dictionaries on the WWW PHONETIC FONTS * Summer Institute of Linguistics IPA Fonts * Yamada Language Center AF version AF3R1 * Platforms: DEC workstations (Alpha and MIPS), SparcStation, SGI * Description: The AF System is a device-independent network-transparent system including client applications and audio servers. With AF, multiple audio applications can run simultaneously, sharing access to the actual audio hardware. The AF3R1 distribution of AF includes server support for Digital RISC systems running Ultrix, Digital Alpha AXP systems running OSF/1, SGI Indigo running IRIX 4.0.5, Sun Microsystems SPARCstations running SunOS 4.1.3, and Sun Microsystems SPARCstations running Solaris 2.3. The servers support audio hardware ranging from the built-in CODEC audio on SPARCstations and Personal DECstations to 48 KHz stereo audio using the DECaudio TURBOchannel module or the SPARCstation DBRI interface * Availability: The source kit is distributed by anonymous ftp from ftp://crl.dec.com/pub/DEC/AF WWW: http://www.research.digital.com/CRL/projects/AF/home.html * Contact: af-request@crl.dec.com MixViews * Description: A Unix/X sound editor. Does waveform play/record, and cut/splice. Has various filters, handles native file formats, FFT, LPC and more * Availability: by anonymous ftp including SunOS 4 and IRIX 5 binaries. ftp://foxtrot.ccmrc.ucsb.edu/pub/MixViews Network Audio System Release 1.1 * Platforms: Various (includes SunOS, Solaris, SGI) * Description: A device-independent mechanism for transferring, playing and recording audio signals over a network. Has a range of features suited to networks. * Cost: Free * Availability: By anonymous ftp from ftp://ftp.x.org:/contrib/audio/nas/netaudio-1.2.tar.gz Also available in the same directory are document files and some sample sounds. NIST SPeech HEader REsources Package (SPHERE) * Description: Standard speech header software from the National Institute of Standards & Technology (NIST). SPHERE headers represent information about sample frequency, sample format, etc. * Availability: By anonymous ftp from Readme File ftp://jaguar.ncsl.nist.gov/pub/sphere.README Source Code ftp://jaguar.ncsl.nist.gov/pub/sphere_2.5.tar.Z NIST Speech Recognition Scoring Package (SCORE) * Description: Software for scoring results of speech recognition systems from the National Institute of Standards & Technology (NIST) . * Availability: By anonymous ftp from README File ftp://jaguar.ncsl.nist.gov/pub/score.README Source Code ftp://jaguar.ncsl.nist.gov/pub/score_3.6.2.tar.Z Sound Processing Kit * Platforms: UNIX * Description: Sound Processing Kit (SPKit) is an object-oriented class library for audio signal processing. SPKit includes classes for various signal processing tasks and a way of implementing sound processing algorithms in a simple object-oriented manner. Sound Processing Kit is implemented in C++ and is designed to be portable. The current version requires a bare-bones C++ 2.0 compatible compiler (templates and exceptions are not needed). ANSI C standard libraries are required. SPKit includes classes for + Sound input and output + Basic signal processing + Dynamics processing (compressor, gating etc) + Filtering + Delay and reverberation + Distortion + Signal routing * Availability: Full documentation on the WWW: http://www.music.helsinki.fi/research/spkit/documentation/SPKi t.html Software distribution: http://www.music.helsinki.fi/research/spkit/distribution/spkit .tar.Z * Contact: Kai Lassfolk University of Helsinki Music Research Laboratory Email: spkit@elisir.helsinki.fi Auditory Modeller 1 * Description: John Holdsworth's implementation of a gammatone filter bank and Roy Patterson's spiral model, in C (with X-window display). * Availability: By anonymous ftp from ftp://ftp.mrc-apu.cam.ac.uk/pub/aim Auditory Modeller 2 * Description:Lowel O'Mard's implementation of peripheral filtering, Ray Meddis's hair cell model and other stuff in C (as a library of routines). * Availability: By anonymous ftp from ftp://suna.lut.ac.uk/public/hulpo/lutear Auditory Toolbox for Matlab * Description: This toolbox provides extensions to Matlab which are useful to people interested in auditory/cochlear modeling. [Matlab is described is the previous section.] This toolbox has been tested on both Macintosh and Unix computers. It includes the following major models: + Lyon's Passive Long Wave Cochlear Model (our conventional model) + Patterson-Holdsworth ERB Filter bank with Meddis Hair cell + Seneff's Auditory Model (Stages I and II) + MFCC (Mel-scale frequency cepstral coefficients from the ASR world) + Spectrogram + Correlogram generation and pitch modeling + Simple vowel synthesis * Availability: By anonymous FTP from the following site: ftp://ftp.apple.com/pub/malcolm The following files are available: + AuditoryToolbox.mif.Z + AuditoryToolbox.psc.Z + AuditoryToolbox.sea.hqx + AuditoryToolbox.tar + AuditoryToolbox.tar.Z The ".mif.Z" file is a Unix compressed version of the FrameMaker documentation. The ".psc.Z" file is a Unix compressed version of the Postscript documentation. The ".tar" and ".tar.Z" files are Unix TAR archives containing all of the m-functions and C-MEX source code. Finally, the ".sea.hqx" file is a Macintosh self-extracting archive that has been encoded using BinHex. There is precompiled version of the three MEX function for the Macintosh. * Misc: Our lawyers ask you to remind you that there is no warranty. We've done some testing but we undoubtably missed things. * Contact: Malcolm Slaney: Interval Resarch. Email: malcolm@interval.com Human Audio Perception Document * Description: Document prepared by Argiris Kranidiotis on the human audio perception system. It lists a number of references, gives plenty of numbers and some equations. * Availability: by anonymous ftp from the comp.speech archive site ftp://svr-ftp.eng.cam.ac.uk/comp.speech/info/HumanAudioPercep tion * Contact: Argiris A. Kranidiotis University Of Athens, Informatics Department email: akra@zeus.di.uoa.ariadne-t.gr BEEP dictionary * Description: Phonemic transcriptions of 150,000 English words. (British English pronunciations) * Availability: By anonymous ftp from the file BEEP dictionary README file svr-ftp.eng.cam.ac.uk/comp.speech/dictionaries/beep-0.6.README BEEP Dictionary (1.1M) svr-ftp.eng.cam.ac.uk/comp.speech/dictionaries/beep-0.6.tar.gz CMU dictionary * Description: Phonemic transcriptions of 100,000 words with American English pronunciation. * Availability: By anonymous ftp from the directory ftp://ftp.cs.cmu.edu/project/fgdata/dict with the files README, cmudict.0.2.Z, cmulex.0.1.Z, phoneset.0.1 CUVOLAD dictionary * Description: Computer Usable Version of the Oxford Advanced Learner's Dictionary. Has British English pronunciations and parts of speech. * Availability: By anonymous ftp from the directory ftp://black.ox.ac.uk/ota/dicts/710 Dictionary * Description: A comprehensive word list which should contain most common American words, abbreviations, hyphenations, and even incorrect spellings. The word lists were compiled from a number of sources: commercial news services, UseNet news postings, existing dictionaries, name lists, company lists, UNIX man pages, project Gutenberg's E-texts, project Wordnet, received mailings, etc. The current size is 460,000 words. * Availability: By anonymous ftp from ftp://wocket.vantage.gte.com/pub/standard_dictionary Note 1: There seems to be some sort of network problem reaching the server. Note 2: There is a README file which explains the file formats. Homophone List * A list of homophones in General American English is available by anonymous FTP from the comp.speech archive site: ftp://svr-ftp.eng.cam.ac.uk/comp.speech/dictionaries/homophone s-1.01.txt MRC database * Description: The Medical Research Council Psycholinguistic Database. Has British English pronunciations, parts of speech, word frequency and lots of other information. * Availability: By anonymous ftp from the directory ftp://black.ox.ac.uk/ota/dicts/1054 Dictionaries on the WWW For a while, there was a range of dictionaries and other lexical resources on the WWW and elsewhere on the Internet. However, due to copyright reasons, fewer sites are publishing dictionary information. When last checked, the following sites provide dictionaries or links to dictionaries on the net: * A comprehensive list of dictionaries, acronym lists, translation resources, and a Thesaurus. http://galaxy.einet.net/galaxy/Reference-and-Interdisciplinary -Information/Dictionaries-etc.html * Webster's dictionary online http://c.gp.cs.cmu.edu:5103/prog/webster ___________________________________________________________________________ Copyright (c) 1995 by Andrew Hunt, all rights reserved. This FAQ may be posted to any USENET newsgroup, on-line service, or BBS as long as it is posted in its entirety and includes this copyright statement. This FAQ may not be distributed for financial gain. This FAQ may not be included in any collections or compilations without express permission from the author. --- Andrew Hunt ATR Interpreting Telecommunications Research Labs Hikari-dai 2-2, Seika-cho, Kyoto, 619-02, Japan Tel: +81-774-95 1390 Fax: +81-774-95 1308 Email: andrew@itl.atr.co.jp ---------------------------------------------------------------------- Path: news1.ucsd.edu!ihnp4.ucsd.edu!swrinde!newsfeed.internetmci.com!news.kei.com!bloom-beacon.mit.edu!senator-bedfellow.mit.edu!faqserv From: andrew@itl.atr.co.jp (Andrew Hunt) Newsgroups: comp.speech,comp.answers,news.answers Subject: comp.speech Frequently Asked Questions - part 2/3 Supersedes: Followup-To: comp.speech Date: 22 Dec 1995 14:10:45 GMT Organization: ATR International, Japan Lines: 1185 Approved: news-answers-request@MIT.Edu Expires: 2 Feb 1996 14:10:32 GMT Message-ID: References: Reply-To: andrew@itl.atr.co.jp (Andrew Hunt) NNTP-Posting-Host: bloom-picayune.mit.edu Summary: Information on Speech Technology X-Last-Updated: 1995/12/19 Originator: faqserv@bloom-picayune.MIT.EDU Xref: news1.ucsd.edu comp.speech:6603 comp.answers:13225 news.answers:51626 Archive-name: comp-speech-faq/part2 Last-modified: 1995/12/19 URL: http://www.speech.su.oz.au/comp.speech/ COMP.SPEECH FAQ POSTING - PART 2/3 [Note: this document has been automatically extracted from a WWW site: http://www.speech.su.oz.au/comp.speech This may introduce some formatting errors.] FAQ SECTION 2 - SIGNAL PROCESSING FOR SPEECH * Q2.1: What sampling do I need for speech? * Q2.2: Finding the pitch of a speech signal * Q2.3: How do I find the start and end points of a speech signal? * Q2.4: Where can I find FFT software? * Q2.5: Signal processing in speech technology * Q2.6: Speech sampling and signal processing hardware * Q2.7: How do I convert to/from mu-law format? ___________________________________________________________________________ Q2.1: WHAT SAMPLING DO I NEED FOR SPEECH? For recorded speech to be understood by humans you need an 8kHz sampling rate or more and at least 8 bit sampling. This produces poor quality speech - but in can be understood. Improvements can be achieved by increasing the number of bits in sampling to 12bits or 16bits, or by using a non-linear encoding technique such as mu-law or A-law (see Q2.7). This improves the "signal-to-noise" ratio. Increasing the sampling rate above 8kHz, say to 10kHz, 16kHz or 20Khz, improves the frequency response: the higher the sampling frequency the better the high frequency content will be. A 16kHz sampling rate is a reasonable target for high quality speech recording and playback. When doing speech recognition you need to remember that the your computer is not as good as your ear so it will have trouble with poor quality sounds. The choice of an appropriate sampling setup depends very much on the speech recognition task and the amount of computer power available. Q2.2: FINDING THE PITCH OF A SPEECH SIGNAL This topic comes up regularly in the comp.dsp newsgroup. Question 2.5 of the FAQ posting for comp.dsp gives a comprehensive list of references on the definition, perception and processing of pitch. The comp.dsp FAQ posting is posted regularly to the comp.dsp newsgroup, and is also available by ftp and on the WWW: * http://www.bdti.com/dsp_faq.htm * ftp://rtfm.mit.edu/pub/usenet/comp.dsp/ Q2.3: HOW DO I FIND THE START AND END POINTS OF A SPEECH SIGNAL? A large number of papers have been presented on this task. Try the following papers: * Rabiner LR, Sambur MR, "An Algorithm for Determining the Endpoints of Isolated Utterances", Bell System Technical Journal, Vol 54, No. 2, pp 297-315, 1975. * Drago, P.G. et al. "Digital Dynamic Speech Detectors." IEEE Trans on Communications, Vol 26, No 1, Jan 78, pp. 140-145. * Newman, W.C. "Detecting Speech with an Adapative Neural Network." Electronic Design. 22 March 1990. * Taboada. J et al "Explicit Estimation of Speech Boundaries" IEE Proc. Sci. Meas. Technol., Vol 141, No.3, May 1994 pp153-159. Q2.4: WHERE CAN I FIND FFT SOFTWARE? The most comprehensive list of FFT I know of is available on the WWW. It contains links to about 65 different pieces of one-dimensional FFT code. http://tjev.tel.etf.hr/josip/DSP/fft.html You might also try the following file available by anonymous ftp. It contains a series of optimised fft routines, including mixed-radix algorithms. ftp://usc.edu/pub/C-numanal/fft-stuff.tar.gz Q2.5: SIGNAL PROCESSING IN SPEECH TECHNOLOGY This question is far to big to be answered in a FAQ posting. Here are some WWW resources and books which cover the area well. Tony Robinson has put his Speech Analysis course notes on the web. The root page is http://svr-www.eng.cam.ac.uk/~ajr/SA95. There is information on the following: * Sampling theory * Filter bank analysis * Short-term fourier analysis * Linear prediction analysis * Formant analysis and voicing analysis * Speech coding * and more.... The Signal Processing Home page has information on a range of DSP issues. It includes references to a range of software and much more. (Note: the page is in Croatia and is quite slow.) http://tjev.tel.etf.hr/josip/DSP/sigproc.html There are many good books which discuss signal processing for speech: * Digital processing of speech signals; L. R. Rabiner, R. W. Schafer. Englewood Cliffs; London: Prentice-Hall, 1978 * Voice and Speech Processing; T. W. Parsons. New York; McGraw Hill 1986 * Computer Speech Processing; ed Frank Fallside, William A. Woods Englewood Cliffs: Prentice-Hall, c1985 * Digital speech processing : speech coding, synthesis, and recognition edited by A. Nejat Ince; Kluwer Academic Publishers, Boston, c1992 * Speech science and technology; edited by Shuzo Saito pub. Ohmsha, Tokyo, c1992 * Speech analysis; edited by Ronald W. Schafer, John D. Markel, New York, IEEE Press, c1979 * Speech Communication: Human and Machine Douglas O'Shaughnessy, Addison Wesley series in Electrical Engineering: Digital Signal Processing, 1987. * Discrete-time processing of speech signals; John R Deller, John G Proakis, John H L Hansen; Macmillan 1993. * Signal processing of speech; F J Owens; Macmillan 1993. Q2.6: SPEECH SAMPLING AND SIGNAL PROCESSING HARDWARE In addition to the following information, have a look at the Audio File format document prepared by Guido van Rossum (see details in Section 1.8). Information is included on hardware for the following systems: * Macintosh Audio Hardware * PC Audio Hardware * Unix Audio Hardware Can anyone provide information for SGI, NeXT, other UNIX hardware and any other PC soundcards? Macintosh Audio Hardware - an overview * Description: ALL Macintosh computers come with the ability to play back sounds at any sample rate (sample rate conversion is done in software.) Older machines have 8 bit stereo output (hardware runs at 22254 samples/second). The newer machines have 16 bit stereo hardare running at 44100 samples/second. Most of the recent Macintosh computers come with sound input hardware. There are probably exceptions to this, but the older and some of the current low-end machines have 8 bit (linear) mono hardware running at 22254.54 samples/second. All of the PowerPC, AV, and the 500 series notebook computers come with 16 bit 44kHz stereo sampling hardware. They can also record at 22050 samples/second. The sound manager implements an AGC (Automatic Gain Control) function for the 8 bit hardware. The drivers have a switch to turn off the AGC. There are a number of DSP vendors that support high quality audio. Generally this means quieter analog sections, and more IO formats (AES/IBU, for example). Try DigiDesign and Spectral Innovations. The software drivers for sound are described in "Inside Macintosh: Sound". If you want to see some sample code check out the sources for the Matlab "Sound and Image Toolbox". They can be found at ftp://ftp.apple.com/pub/malcolm/SoundAndImageToolbox.cpt.hqx Routines that play and record sounds using the toolbox are included (and interfaced to Matlab). PC Audio Hardware Note: new soundcards are becoming available all the time - the information below is definately not up to date. Check out the following newsgroups for up-to-date information. * comp.sys.ibm.pc.soundcard * comp.sys.ibm.pc.soundcard.GUS * comp.sys.ibm.pc.soundcard.advocacy * comp.sys.ibm.pc.soundcard.games * comp.sys.ibm.pc.soundcard.misc * comp.sys.ibm.pc.soundcard.music * comp.sys.ibm.pc.soundcard.tech An excellent sources of programs and information for soundcards is available on SimTel: http://www.acs.oakland.edu/oak/SimTel/win3/sound.html Additional information on PC soundcards is available by anonymous ftp from: ftp://rtfm.mit.edu/pub/usenet/comp.sys.ibm.pc.soundcard.misc/Aria_S oundcard_FAQ_v1.05 ftp://rtfm.mit.edu/pub/usenet/comp.sys.ibm.pc.soundcard.misc/Aria_S oundcard_Support_List_v2.09 ftp://rtfm.mit.edu/pub/usenet/comp.sys.ibm.pc.soundcard.misc/Midi_f iles_software_archives_on_the_Internet ftp://rtfm.mit.edu/pub/usenet/comp.sys.ibm.pc.soundcard.misc/Turtle _Beach_sound_cards_FAQ IBM RS/6000 ACPA (Audio Capture and Playback Adapter) * Description: The card supports PCM, Mu-Law, A-Law and ADPCM at 44.1kHz (& 22.05, 11.025, 8kHz) with 16-bits of resolution in stereo. The card has a built-in DSP (don't know which one). The device also supports various formats for the output data, like big-endian, twos complement, etc. Good noise immunity. The card is used for IBM's VoiceServer (they use the DSP for speech recognition). Apparently, the IBM voiceserver has a speaker-independent vocabulary of over 20,000 words and each ACPA can support two independent sessions at once. * Cost: $US495 * Contact: ? Sound Galaxy NX , Aztech Systems * Platform: PC - DOS,Windows 3.1 * Cost: ? * Input: 8bit linear, 4-22 kHz. * Output: 8bit linear, 4-44.1 kHz * Misc: 11-voice FM Music Synthesizer YM3812; Built-in power amplifier; DSP signal processing support - ST70019SB, Hardware ADPCM decompression (2:1,3:1,4:1) "AdLib" and "Sound Blaster" compatbility. Dicon DSProto * Description: DSP/PC card (ISA bus) with TI TMS320C31 (40 or 50MHz), 32Kx32 zero wait state SRAM, external bus and serial port. Provided with C3X assembler/linker and PC/DSP Utility Program (C/DSP code library) which include routines for FFTs, IIRs, FIRs and Mu-law (CELP and JPEG also - but licensed). * Cost: $US419.95 industry, $US399.95 education * See also: DSProto Codec below * Contact: Dicon Lab 1810 NW 23rd Blvd., Suite 164 Gainesville, FL 32605 phone: 904-372-6160 fax: 904-376-7215 email: diconlab@aol.com Dicon DSProto Codec * Platform: PC * Description: External board which attaches to the DSProto serial port. 16 bit, dual-channel, 7.35-44.1kHz sampling A/D and D/A. Includes drivers for DSProto and a demo program which echo, bass and LPF effects. * Cost: $US159.95 industry, $US149.95 education * See also: DSProto above * Contact: Dicon Lab 1810 NW 23rd Blvd., Suite 164 Gainesville, FL 32605 phone: 904-372-6160 fax: 904-376-7215 email: diconlab@aol.com Sound Galaxy NX PRO, Aztech Systems * Platform: PC - DOS,Windows 3.1 * Cost: ? * Input: 2 * 8bit linear, 4-22.05 kHz(stereo), 4-44.1 KHz(mono). * Output: 2 * 8bit linear, 4-44.1 kHz(stereo/mono) * Misc: 20-voice FM Music Synthesizer; Built-in power amplifier; Stereo Digital/Analog Mixer; Configuration in EEPROM. Hardware ADPCM decompression (2:1,3:1,4:1). Includes DSP signal processing support. "AdLib" and "Sound Blaster Pro II" compatybility. Software includes a simple Text-to-Speech program and Sampling laboratory for Windows 3.1: WinDAT. * Contact: USA (510)6238988 ATI Stereo F/X Sound Board * Platform:PC XT or AT - DOS, Windows 3.0, 3.1 * Cost: $120 Canadian * Description: Input - 8 bit ADC, 44.1 kHz mono, 22.05 kHz Stereo. Output - Dynamic range = 48 dB, 32 anti-aliasing filters. Adds Stereo effect to existing mono Adlib or Sound Blaster apps. 11-voice YAMAHA FM Music Synthesizer. Built-in 8 watt power amplifier, 4 watts per channel. Volume ctrl on rear. 2 Joystick input, software setup (no switches), software included. "AdLib" and "Sound Blaster" compatibility. DMA support for high speed digital audio. ADPCM decomp @ 4:1, 3:1, 2:1. Will play .WAV files. Optional MIDI I/O port $79. (MIDI IN, OUT, THRU, and sequencer). * Contact: ATI Technologies Inc. 3761 Victoria Park Avenue, Scarborough, Ontario CANADA, M1W 3S2 Ph: (416) 756-0711 Fax: (416) 756-0720 BBS: (416) 764-9404 (9600 baud N.8.1) Ariel Signal Processors * Description: A range of signal I/O, A/D, D/A and DSP products are available. There are too many to list. * Contact: Ariel Corp. 433 River Road, Highland Park, NJ 08904. Ph: 908-249-2900 Fax: 908-249-2123 DSP BBS: 908-249-2124 Other PC Sound Cards ============================================================================ sound stereo/mono compatible included voices card & sample rate with ports ============================================================================ Adlib Gold stereo: 8-bit 44.1khz Adlib ? audio 20 (opl3) 1000 16-bit 44.1khz in/out, +2 digital mono: 8-bit 44.1khz mic in, channels 16-bit 44.1khz joystick, MIDI Sound Blaster mono: 8-bit 22.1khz Adlib audio 11 synth. FM synth with in/out, 2 operators joystick, Sound Blaster stereo: 8-bit 22.05khz Adlib audio 22 Pro Basic mono: 8-bit 44.1khz Sound Blaster in/out, joystick, Sound Blaster stereo: 8-bit 22.05khz Adlib audio 11 Pro mono: 8-bit 44.1khz Sound Blaster in/out joystick, MIDI, SCSI Sound Blaster stereo: 8-bit 4-44.1khz Sound Blaster audio 20 16 ASP stereo: 16-bit 4-44.1khz in/out, joystick, MIDI Audio Port mono: 8-bit 22.05khz Adlib audio 11 Sound Blaster in/out, joystick Pro Audio stereo: 8-bit 44.1khz Adlib audio, 20 Spectrum + Pro Audio in/out, Spectrum joystick Pro Audio stereo: 16-bit 44.1khz Adlib audio 20 Spectrum 16 Pro Audio in/out, Spectrum joystick, Sound Blaster MIDI, SCSI Thunder Board stereo: 8-bit 22khz Adlib audio 11 Sound Blaster in/out, joystick Gravis stereo: 8-bit 44.1khz Adlib, audio line 32 sampled Ultrasound mono: 8-bit 44.1khz Sound Blaster in/out, 32 synth. amplified out, (w/16-bit daughtercard) mic in, CD stereo: 16-bit 44.1khz audio in, mono: 16-bit 44.1khz daughterboard ports (for SCSI and 16-bit) MultiSound stereo: 16-bit 44.1kHz Nothing audio 32 sampled 64x oversampling in/out, joystick, MIDI ============================================================================= Unix Audio Hardware Could someone please provide information on the audio capabilities of DECstations, SGI and other Unix platforms? Sun standard audio port: SPARC I & II * Input and Output: 1 channel, 8 bit mu-law encoded, 8kHz sample rate. This provides telephone quality sampling. Sun DBRI audio port (SPARC 10 & 20) * Input and Output: Stereo (2 channels). 16-bit linear sampling. Multiple sample rates (48000, 44100, 37800, 32000, 22050, 18900, 16000, 11025, 9600, 8000 Hz) Ariel Signal Processors * Platform: Various * Description: A range of signal I/O, A/D, D/A and DSP products are available. There are too many to list. * Contact: Ariel Corp. 433 River Road, Highland Park, NJ 08904. Ph: 908-249-2900 Fax: 908-249-2123 DSP BBS: 908-249-2124 Q2.7: HOW DO I CONVERT TO/FROM MU-LAW FORMAT? Mu-law coding is a form of compression for audio signals including speech. It is widely used in the telecommunications field because it improves the signal-to-noise ratio without increasing the amount of data. Typically, mu-law compressed speech is carried in 8-bit samples. It is a companding technqiue. That means that carries more information about the smaller signals than about larger signals. On SUN Sparc systems have a look in the directory /usr/demo/SOUND. Included are table lookup macros for ulaw conversions. [Note however that not all systems will have /usr/demo/SOUND installed as it is optional - see your system admin if it is missing.] OR, here is some sample conversion code in C. /** ** Signal conversion routines for use with Sun4/60 audio chip **/ #include stdio.h unsigned char linear2ulaw(/* int */); int ulaw2linear(/* unsigned char */); /* ** This routine converts from linear to ulaw ** ** Craig Reese: IDA/Supercomputing Research Center ** Joe Campbell: Department of Defense ** 29 September 1989 ** ** References: ** 1) CCITT Recommendation G.711 (very difficult to follow) ** 2) "A New Digital Technique for Implementation of Any ** Continuous PCM Companding Law," Villeret, Michel, ** et al. 1973 IEEE Int. Conf. on Communications, Vol 1, ** 1973, pg. 11.12-11.17 ** 3) MIL-STD-188-113,"Interoperability and Performance Standards ** for Analog-to_Digital Conversion Techniques," ** 17 February 1987 ** ** Input: Signed 16 bit linear sample ** Output: 8 bit ulaw sample */ #define ZEROTRAP /* turn on the trap as per the MIL-STD */ #define BIAS 0x84 /* define the add-in bias for 16 bit samples */ #define CLIP 32635 unsigned char linear2ulaw(sample) int sample; { static int exp_lut[256] = {0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3, 4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4, 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5, 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5, 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6, 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6, 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6, 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7}; int sign, exponent, mantissa; unsigned char ulawbyte; /* Get the sample into sign-magnitude. */ sign = (sample >> 8) & 0x80; /* set aside the sign */ if (sign != 0) sample = -sample; /* get magnitude */ if (sample > CLIP) sample = CLIP; /* clip the magnitude */ /* Convert from 16 bit linear to ulaw. */ sample = sample + BIAS; exponent = exp_lut[(sample >> 7) & 0xFF]; mantissa = (sample >> (exponent + 3)) & 0x0F; ulawbyte = ~(sign | (exponent << 4) | mantissa); #ifdef ZEROTRAP if (ulawbyte == 0) ulawbyte = 0x02; /* optional CCITT trap */ #endif return(ulawbyte); } /* ** This routine converts from ulaw to 16 bit linear. ** ** Craig Reese: IDA/Supercomputing Research Center ** 29 September 1989 ** ** References: ** 1) CCITT Recommendation G.711 (very difficult to follow) ** 2) MIL-STD-188-113,"Interoperability and Performance Standards ** for Analog-to_Digital Conversion Techniques," ** 17 February 1987 ** ** Input: 8 bit ulaw sample ** Output: signed 16 bit linear sample */ int ulaw2linear(ulawbyte) unsigned char ulawbyte; { static int exp_lut[8] = {0,132,396,924,1980,4092,8316,16764}; int sign, exponent, mantissa, sample; ulawbyte = ~ulawbyte; sign = (ulawbyte & 0x80); exponent = (ulawbyte >> 4) & 0x07; mantissa = ulawbyte & 0x0F; sample = exp_lut[exponent] + (mantissa << (exponent + 3)); if (sign != 0) sample = -sample; return(sample); } ___________________________________________________________________________ FAQ SECTION 3 - SPEECH CODING AND COMPRESSION * Q3.1: Speech compression techniques * Q3.2: References on coding/compression * Q3.3: Compression and Coding Software Q3.1: SPEECH COMPRESSION TECHNIQUES Note: the comp.compression FAQ includes a few questions and answers on the compression of speech. The aim of speech compression is to produce a compact representation of speech sounds such that when reconstructed it is perceived to be close to the original. The two main measures of closeness are intelligibility and naturalness. The standard reference point is toll quality speech, this is the same as what would be expected over a telephone line, for example, speech coded at 8 kHz using 8 bit ulaw coding and a maximum frequency of about 3.3 kHz. This is a bit rate of 64 kbps, and as such represents a compressed form over (say) 16 bit, 16 kHz speech which is the standard in speech recognition work. ulaw coding does not exploit the (normally large) sample to sample correlations found in speech. ADPCM is the next family of speech coding techniques, and does exploit this redundancy by using a simple linear filter to predict the next sample of speech. The resulting prediction error is typically quantised to 4 bits thus giving a bit rate of 32 kbps (see, for example, the software in Q3.3: 32 kbps ADPCM, G.711/721/723 Compression, shorten). The advantages of ADPCM are that is simple to implement and has very low delay. To obtain more compression specific properties of the speech signal must be modelling. The main assumption is known as the source filter model of speech production. This assumes that a source (voicing or fricative excitation) is passed through a filter (the vocal tract response) to produce the speech. The simplest implementation of this is known as a LPC synthesiser (e.g. LPC10e). At every frame the speech is analysed to compute the filter coefficients, the energy of the excitation, a voicing decision, and a pitch value if voiced. At the decoder a regular set of pulses for voiced speech or white noise for unvoiced speech is passed through the linear filter and multiplied by the gain to produce the speech. This is a very efficient system and typically produces speech coded at 1200-2400bps. With clever acoustic vector prediction this can be reduced to 300-600bps. The disadvantages are a loss of naturalness over most of the speech and occasionally a loss of intelligibility. The CELP family of coders compensates for the lack of quality of the simple LPC model by using more information in the excitation. Each of a set of codebook of excitation vectors is tried and the index of the one that best matches the original speech is transmitted. This results in an increase in the bit rate to typically 4800-9600bps. Most speech coding research is currently directed towards CELP coders. (See, for example, CELP 3.2a, a TMS implementation, a G.728 LD-CELP vocoder, and the L&H implementation. Q3.2: REFERENCES ON CODING/COMPRESSION Tony Robinson's lecture notes on Speech Analysis have some coverage of speech coding (http://svr-www.eng.cam.ac.uk/~ajr/SA95/node78.html). The following books cover speech coding/compression. * Douglas O'Shaughnessy, Speech Communication: Human and Machine, Addison Wesley series in Electrical Engineering: Digital Signal Processing, 1987. * Bishnu Atal in ed. Fallside, F. and W. Woods, ed. Computer Speech Processing. London: Prentice/Hall International, 1985. * Makhoul, J. "Linear Prediction: A Tutorial Review." Proc. of the IEEE 63 (1975): 561 - 580. Q3.3: COMPRESSION AND CODING SOFTWARE The following speech compression software is described in the FAQ. * 32 kbps ADPCM * CELP 3.2a & LPC * 8 Kbit/s CELP on the TMS320C5x family of DSP chips * File format conversion * G.711/721/723 Compression * G.728 LD-CELP vocoder * G.728 Compression * GSM 06.10 Compression * Lernout & Hauspie Speech Coding (5 products) * Lernout & Hauspie Speech Coding SDK * shorten - a lossless compressor for speech signals * TrueSpeech from DSP Group * U.S.F.S. 1016 CELP vocoder for DSP56001 * ToolVox from Voxware 32 kbps ADPCM * Platform: SGI and Sun Sparcs * Description: 32 kbps ADPCM C-source code (G.721 compatibility is uncertain) * Contact: Jack Jansen * Availablity: Anoymous ftp ftp://ftp.cwi.nl/pub/adpcm.shar CELP 3.2a & LPC * Platform: Sun (the makefiles & source can be modified for other platforms) * Description: CELP is lossy compression technqiue. The U.S. DoD's Federal-Standard-1016 based 4800 bps code excited linear prediction voice coder version 3.2a (CELP 3.2a) Fortran and C simulation source codes. Available for worldwide distribution (on DOS diskettes, but configured to compile on Sun SPARC stations) from NTIS and DTIC. Example input and processed speech files are included. A Technical Information Bulletin (TIB), "Details to Assist in Implementation of Federal Standard 1016 CELP," and the official standard, "Federal Standard 1016, Telecommunications: Analog to Digital Conversion of Radio Voice by 4,800 bit/second Code Excited Linear Prediction (CELP)," are also available. * Availability 1: National Technical Information Service (NTIS) U.S. Department of Commerce 5285 Port Royal Road, Springfield, VA 22161, USA The "AD" ordering number for the CELP software is AD M000 118 (US$ 90.00) and for the TIB it's AD A256 629 (US$ 17.50). The LPC-10 standard, described below, is FIPS Pub 137 (US$ 12.50). There is a $3.00 shipping charge on all U.S. orders. The telephone number for their automated system is 703-487-4650, or 703-487-4600 if you'd prefer to talk with a real person. (U.S. DoD personnel and contractors can receive the package from the Defense Technical Information Center: DTIC, Building 5, Cameron Station, Alexandria, VA 22304-6145. Their telephone number is 703-274-7633.) * Availability 2: By anonymous ftp from: From ftp.super.org ftp://ftp.super.org(192.31.192.1)/pub/celp_3.2a.tar.Z Or from the comp.speech ftp server ftp://svr-ftp.eng.cam.ac.uk/comp.speech/coding/celp_3.2a.tar.Z ftp://svr-ftp.eng.cam.ac.uk/comp.speech/coding/celp_3.2a.tar.g z * Misc: The following articles describe the Federal-Standard-1016 4.8-kbps CELP coder (it's unnecessary to read more than one): + Campbell, Joseph P. Jr., Thomas E. Tremain and Vanoy C. Welch, "The Federal Standard 1016 4800 bps CELP Voice Coder," Digital Signal Processing, Academic Press, 1991, Vol. 1, No. 3, p. 145-155. + Campbell, Joseph P. Jr., Thomas E. Tremain and Vanoy C. Welch, "The DoD 4.8 kbps Standard (Proposed Federal Standard 1016)," in Advances in Speech Coding, ed. Atal, Cuperman and Gersho, Kluwer Academic Publishers, 1991, Chapter 12, p. 121-133. + Campbell, Joseph P. Jr., Thomas E. Tremain and Vanoy C. Welch, "The Proposed Federal Standard 1016 4800 bps Voice Coder: CELP," Speech Technology Magazine, April/May 1990, p. 58-64. The U.S. DoD's Federal-Standard-1015/NATO-STANAG-4198 based 2400 bps linear prediction coder (LPC-10) was republished as a Federal Information Processing Standards Publication 137 (FIPS Pub 137). It is described in: + Thomas E. Tremain, "The Government Standard Linear Predictive Coding Algorithm: LPC-10," Speech Technology Magazine, April 1982, p. 40-49. There is also a section about FS-1015 in the book: + Panos E. Papamichalis, Practical Approaches to Speech Coding, Prentice-Hall, 1987. The voicing classifier used in the enhanced LPC-10 (LPC-10e) is described in: + Campbell, Joseph P., Jr. and T. E. Tremain, "Voiced/Unvoiced Classification of Speech with Applications to the U.S. Government LPC-10E Algorithm," Proceedings of the IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 1986, p. 473-6. Copies of the official standard, "Federal Standard 1016, Telecommunications: Analog to Digital Conversion of Radio Voice by 4,800 bit/second Code Excited Linear Prediction (CELP)" are available for US$ 5.00 each from: + GSA Federal Supply Service Bureau Specification Section, Suite 8100 470 E. L'Enfant Place, S.W. Washington, DC 20407 (202)755-0325 Realtime DSP code for FS-1015 and FS-1016 is sold by: + John DellaMorte, DSP Software Engineering 165 Middlesex Tpk, Suite 206, Bedford, MA 01730, USA Ph: 1-617-275-3733 Fax: 1-617-275-4323 Email: dspse.bedford@channel1.com DSP Software Engineering's FS-1016 code can run on a DSP Research's Tiger 30 (a PC board with a TMS320C3x and analog interface suited to development work). + DSP Research 1095 E. Duane Ave, Sunnyvale, CA 94086, USA Ph: (408)773-1042 Fax: (408)736-3451 8 Kbit/s CELP on the TMS320C5x family of DSP chips * Description: For low bandwidth transmission of voice, compact voice storage for archival purposes, low-cost digital answering machines and efficient storage for voice mail. Features : + near toll quality at 8 Kb/s. + Variable rate option with 1 Kb/s silence encoding. + Implemented on a fixed-point processor for lower system cost. + Attractive licensing scheme. + Future availability of 4 Kb/s. + Custom rates possible. Capacity : + Two half-duplex or one full duplex channels on the 20 MIPS 'C5x (at 95% and 55% CPU utilization respectively). + Two full duplex channels on the 28.6 MIPS 'C5x (at 77% CPU utilization). + Requires 9 K-words program memory and 3 K-words data memory. + Decoding in real-time on a 486 class CPU. * Contact: CVI Inc. 443 Vienna Cres. North Vancouver, BC, Canada V7N 3B3 Tel: (604) 987 1719 Fax: (604) 986 8139 Email: cvi@extropia.wimsey.com File format conversion * Platform: SUN OS? * Description: Conversion utility able to encode and decode between the the following formats: G.723, G.721, A-law, u-law and linear. * Availability: By anonymous ftp from ftp://ftp.cwi.nl/pub/audio/ccitt-adpcm.tar.Z G.711/721/723 Compression * Description: + G.711 : CCITT u-law and A-law compression + G.721 : CCITT 32 kbps ADPCM coder + G.723 : CCITT 24 kbps and 40 kbps ADPCM coders * Availability: By email to itudoc@itu.ch, with GET ITU-3022 as the *only* line in the body of the message. It is also available by anonymous ftp from: ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/coding/G711_G721_G 723.tar.Z G.728 LD-CELP vocoder * Platform: Analog Devices ADSP-2171 * Description: Real-time, full-duplex G.728 LD-CELP vocoder that runs on a single Analog Devices ADSP-2171. Source and object code available for a one-time license fee. * Contact: Cole Erskine Analogical Systems 299 California Avenue, Suite 120 Palo Alto, CA 94306, USA Tel:(415) 323-3232 FAX:(415) 323-4222 email: cole@analogical.com G.728 Compression * Description: G.728 low delay celp package written by Alex Zatsman of Analog Devices, Inc. * Availability: By anonymous ftp from ftp://dspsun.eas.asu.edu/pub/speech/ldcelp.tgz GSM 06.10 Compression * Platform: Unix; faster than real time on most Sun SPARCstations * Description: GSM 06.10 is a standardized lossy speech compression employed by most European wireless telephones. It uses RPE/LTP (residual pulse excitation/long term prediction) coding to compress frames of 160 13-bit samples (8 kHz sampling rate, i.e. a frame rate of 50 Hz) into 260 bits. * Contact: GSM 06.10 support and implementation jutta@cs.tu-berlin.de, cabo@cs.tu-berlin.de * Availability: The following configurations are available be anonymous ftp: gzip compression from Germany: ftp://ftp.cs.tu-berlin.de/pub/local/kbs/tubmik/gsm/gsm-1.0.7 .tar.gz MS-DOS compression from Germany: ftp://ftp.cs.tu-berlin.de/pub/local/kbs/tubmik/gsm/ddj/gsm-1 07.zip MS-DOS compression from USA: ftp://ftp.mv.com/pub/ddj/1194.12/gsm-105.zip * Misc: The WWW site is http://www.cs.tu-berlin.de/~jutta/toast.html Lernout & Hauspie Speech and Music Coding Product Range * Product name: L&H.smc650: 32kbps ADPCM Speech coding + Implementation of ADPCM 32 kbps based on CCITT G721 standard. + Estimated quality: 4.1 MOS (Mean Opinion Score) + Hardware Example: Analog Devices ADSP2101 + Input / Output signal: A-Law or mu-Law PCM (64 kbps); Linear signal with up to 16 bits per sample; 8 kHz sampling rate * Product name: L&H.smc550: LD-CELP 16 kbps speech coding + Proprietary implementation of LD-CELP 16 kbps based on CCITT G728 standard. + Estimated quality: 4.0 MOS (Mean Opinion Score) + Hardware Example: Motorola 5600X + Input / Output signal: A-Law or mu-Law PCM (64 kbps); Linear signal with up to 16 bits per sample; 8 kHz sampling rate * Product name: L&H.smc450: 16-17.5 kbps speech coding + Estimated Quality: 3.9 MOS (Mean Opinion Score) + Hardware Examples: Analog Devices ADSP2101, Intel 486 DX2/66 MHz + Input / Output Signal: A-Law or mu-Law PCM (64 kbps); Linear signal with up to 16 bits per sample; 8 kHz sampling rate. * Product name: L&H.smc350: 4.8-9.6 kbps speech coding + Proprietary CELP based software for compression rates of 4.8 kbps to 9.6 kbps + Estimated Quality: 3.5 MOS (Mean Opinion Score) + Hardware Examples: AT&T DSP32C + Input / Output signal: A-Law or mu-Law PCM (64 kbps); Linear signal with up to 16 bits per sample; 8 kHz or 11.025kHz sampling rate. * Product name: L&H.smc250: 2.4 kbps speech coding + Combination of multi band excitation and code book excited linear prediction. + Estimated Quality: 3.0 MOS (Mean Opinion Score). + Hardware Examples: Intel 486 DX2/66 MHz, Analog Devices ADSP2101 + Input signal: A-Law or mu-Law PCM (64 kbps); Linear signal with 12-15 bits per sample; 8 kHz sampling rate. + Output signal: A-Law or mu-Law PCM (64 kbps); Linear signal with 12-15 bits per sample; 8 kHz sampling rate. * See also: L&H Speech Coding SDK * Cost: Unknown * Contact: Lernout & Hauspie Speech Products 800 West Cummings Park, Suite 3100 Woburn, MA 01801, USA Tel: (617) 932 4118 Fax: (617) 932 9209 Email: sales@lhs.com Lernout & Hauspie Speech Coding SDK * Description: Windows based software development kit for integrating speech coding technology with Windows based PC applications. * Requirements: IBM-compatible 486 DX/33 MHz + 2MB RAM + MS DOS 5.0 + MS Windows 3.1 (or higher) + Sound Blaster compatible sound board. * See also: L&H Speech Coding Products * Cost: Unknown * Contact: Lernout & Hauspie Speech Products 800 West Cummings Park, Suite 3100 Woburn, MA 01801, USA Tel: (617) 932 4118 Fax: (617) 932 9209 Email: sales@lhs.com shorten - a lossless compressor for speech signals * Platform: UNIX/DOS * Description: A fast waveform coder suitable for a speech and music signals in a wide variety of file formats. The degree of compression is adjustable from lossless to three bits a sample. 16bit 16kHz speech generally attains 50% lossless compression and 16:3 compression of CDROM quality speech is obtainable with only minor audiable degredation. * Availability: Anonymous ftp - UNIX and DOS versions ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/coding/shorten.tar .gz ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/coding/shorten.tar .Z ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/coding/shorten.zip TrueSpeech from DSP Group * Description: TrueSpeech is a family of speech compression and decompression algorithms and software. It is designed for personal computers and personal communications devices. With the high compression ratios ranging from 15:1 to 27:1, TrueSpeech improves the storage and communications transmission of digital voice information and can be used in the integration of personal computers and telephones. TrueSpeech can be utilized in many products and applications such as: + Multimedia PCs + Sound cards and modems + Computer/telephony and teleconferencing + Voice mail systems and PBX systems + Wireless/cellular applications + Personal digital assistants + Games, Education + Video/cable and on-line services The TrueSpeech encoder is available for free in the Sound System of Windows 95 and Windows NT. The DSPG WWW pages have information on how to add TrueSpeech capability to your WWW pages. * Contact: DSP Group, Inc. 3120 Scott Boulevard, Santa Clara, CA 95054-3317, USA Phone: (408) 986-4300 Fax: (408) 986-4323 Email: Webster@dspg.com WWW: http://www.dspg.com/index.html U.S.F.S. 1016 CELP vocoder for DSP56001 * Platform: DSP56001 * Description: Real-time U.S.F.S. 1016 CELP vocoder that runs on a single 27MHz Motorola DSP56001. Free demo software available for PC-56 and PC-56D. Source and object code available for a one-time license fee. * Contact: Cole Erskine Analogical Systems 299 California Avenue, Suite 120 Palo Alto, CA 94306, USA Tel:(415) 323-3232 FAX:(415) 323-4222 Email: cole@analogical.com ToolVox from Voxware * Platform: Windows and soon available on Mac (in Beta now) and Unix * Description: ToolVox is a proprietary frequency domain speech coder. 11 KHz speech is coded to an average rate of between 5,000 bits per second and 9,000 bps. Real-time compression algorithms available for 2,400 bps. 22 KHz playback, as well as a ultra low bit rate 8 KHz codec are coming soon. On playback, the time scale can be changed by a 5x factor, pitch can be modified over a 3 octave range, and vocal personality can be modified using a tranformation function called VoiceFonts(tm). * Misc 1: A SDK for Windows is available. * Misc 2: Demo software is available from the Voxware Inc WWW page: http://www.voxware.com/ * Price: Basic toolkit is $895 US. OEM and mass distribution licenses are separate. Ordering information is provided on the Voxware WWW server. * Contact: Voxware, Inc. Ph: (609) 497-1212 Fax: (609) 497-2490 Sale information: sales@voxware.com WWW: http://www.voxware.com/ ___________________________________________________________________________ FAQ SECTION 4 - NATURAL LANGUAGE PROCESSING There is now a newsgroup specifically for Natural Language Processing; comp.ai.nat-lang. A FAQ posting is available for the group: ftp://rtfm.mit.edu/pub/usenet/comp.ai.nat-lang/Natural_Language_Pro cessing_FAQ There is also a lot of useful information on Natural Language Processing in the comp.ai FAQ. That FAQ lists available software and useful references. It includes a substantial list of software, documentation and other info available by ftp. The FAQ has information on the following: * Q4.1: NLP References and Books * Q4.2: NLP Software Q4.1: NLP REFERENCES AND BOOKS Take a look at the FAQ for the "comp.ai" newsgroup as it also includes some useful references. * James Allen: Natural Language Understanding, (Benjamin/Cummings Series in Computer Science) Menlo Park: Benjamin/Cummings Publishing Company, 1987. + This book consists of four parts: syntactic processing, semantic interpretation, context and world knowledge, and response generation. * G. Gazdar and C. Mellish, Natural Language Processing in Prolog, Addison Wesley, 1989 * G. Gazdar and C. Mellish, Natural Language Processing in Lisp, Addison Wesley, 1989 * G. Gazdar and C. Mellish, Natural Language Processing in Pop11, Addison Wesley, 1989 + Emphasis on parsing, especially unification-based parsing, lots of details on the lexicon, feature propagation, etc. Fair coverage of semantic interpretation, inference in natural language processing, and pragmatics; much less extensive than in Allen's book, but more formal. There are three versions, one for each programming language listed above, with complete code. * Shapiro, Stuart C.: Encyclopedia of Artificial Intelligence Vol.1 and 2. New York: John Wiley & Sons, 1990. + There are articles on the different areas of natural language processing which also give additional references. * Paris, Ce'cile L.; Swartout, William R.; Mann, William C.: Natural Language Generation in Artificial Intelligence and Computational Linguistics. Boston: Kluwer Academic Publishers, 1991. + The book describes the most current research developments in natural language generation and all aspects of the generation process are discussed. The book is comprised of three sections: one on text planning, one on lexical choice, and one on grammar. * Readings in Natural Language Processing, ed by B. Grosz, K. Sparck Jones and B. Webber, Morgan Kaufmann, 1986 + A collection of classic papers on Natural Language Processing. Fairly complete at the time the book came out (1986) but now seriously out of date. Still useful for ATN's, etc. * Klaus K. Obermeier, Natural Language Processing Technologies in Artificial Intelligence: The Science and Industry Perspective, Ellis Horwood Ltd, John Wiley & Sons, Chichester, England, 1989. The following are extensive bibliographies related to NLP: * Computational Parsing : Syntactic Analysis, Semantic Analysis, Semantic Interpretation, Parsing Algorithms, Parsing Strategies : BIBLIOGRAPHY, by Conrad F. Sabourin 1994, 2 volumes, 1029p, ISBN 2-921173-02-6, INFOLINGUA inc., P.O. Box 187 Snowdon, Montreal, H3X 3T4, Canada. * Computational Text Understanding : Natural Language Programming, Argument Analysis : BIBLIOGRAPHY, by Conrad F. Sabourin 1994, 657p, ISBN 2-921173-06-9, INFOLINGUA inc., P.O. Box 187 Snowdon, Montreal, H3X 3T4, Canada. * Computational Text Generation : Generation from data or Linguistic Structure, Text Planning, Sentence Generation, Explanation Generation : BIBLIOGRAPHY, by Conrad F. Sabourin with a survey article by Mark T. Maybury 1994, 649p, ISBN 2-921173-07-7, INFOLINGUA inc., P.O. Box 187 Snowdon, Montreal, H3X 3T4, Canada. * Natural Language Processing : Interfaces to Databases, to Expert Systems, to Robots, to Operating Systems, and to Question-Answering Systems : BIBLIOGRAPHY, by Conrad F. Sabourin, 1994, 2 volumes, 847p, ISBN 2-921173-08-5 INFOLINGUA inc., P.O. Box 187 Snowdon, Montreal, H3X 3T4, Canada Journals The major journals of the field are * Computational Linguistics and Cognitive Science for the artificial intelligence aspects, * Cognition for the psychological aspects, * Language and Linguistics and Philosophy and Linguistic Inquiry for the linguistic aspects. * Artificial Intelligence occasionally has papers on natural language processing. Conferences The major conferences of the field are * ACL (held every year) * COLING (held every two years). Most AI conferences have a NLP track; AAAI, ECAI, IJCAI and the Cognitive Science Society conferences usually are the most interesting for NLP. CUNY is an important psycholinguistic conference. There are lots of linguistic conferences: the most important seem to be NELS, the conference of the Chicago Linguistic Society (CLS), WCCFL, LSA, the Amsterdam Colloquium, and SALT. Q4.2: NLP SOFTWARE Natural Language Software Registry (NLSR) - NLP Tools * The Natural Language Software Registry is available from the German Research Institute for Artificial Intelligence (DFKI) in Saarbrucken. Its purpose is to facilitate the exchange and evaluation of natural language processing software within the research community. To this end, the NLSR is cataloging natural language software projects, both commercial and non- commercial. The new updated and enlarged version contains more than 100 descriptions of natural processing software. Registry listings include: + speech signal processors, such as the Computerized Speech Lab (Kay Elemetrics) + morphological analyzers, such as PC-KIMMO (Summer Institute for Linguistics) + parsers, such as Alveytools (University of Edinburgh) + semantic and pragmatic analyzer, such as NLL (University of the Saarland, Germany) + generation programs, such as FUF (Ben Gurion University of the Negev) + knowledge representation systems, such as Rhet (University of Rochester) + multicomponent systems, such as ELU (ISSCO), PENMAN (ISI), Pundit (UNISYS), SNePS (SUNY Buffalo), + NLP-Tools, such as GULP (University of Georgia) or Linguist (Kansai Research Laboratory) + applications programs (misc.) * If you have developed a piece of software for natural language processing that other researchers might find useful, you can include it by returning the questionnaire available from the sources below. * ftp://ftp.dfki.uni-sb.de/pub/registry * e-mail: registry@dfki.uni-sb.de * post: Natural Language Software Registry Deutsches Forschungsinstitut fuer Kuenstliche Intelligenz (DFKI) Stuhlsatzenhausweg 3 D-66123 Saarbruecken Germany * Other ftp sites are ftp://crlftp.nmsu.edu/pub/non-lexical/NL_Software_Registy ftp://dri.cornell.edu/pub/Natural_Language_Software_Registry Part of Speech Tagger * Description: A rule-based part pf speech tagger developed by Eric Brill. For a detailed description of the tagger see chapter 6 of his thesis. * Availability: The tagger and description are available by anonymous ftp from ftp://lightning.lcs.mit.edu/pub/BRILL/Programs & Papers ___________________________________________________________________________ Copyright (c) 1995 by Andrew Hunt, all rights reserved. This FAQ may be posted to any USENET newsgroup, on-line service, or BBS as long as it is posted in its entirety and includes this copyright statement. This FAQ may not be distributed for financial gain. This FAQ may not be included in any collections or compilations without express permission from the author. --- Andrew Hunt ATR Interpreting Telecommunications Research Labs Hikari-dai 2-2, Seika-cho, Kyoto, 619-02, Japan Tel: +81-774-95 1390 Fax: +81-774-95 1308 Email: andrew@itl.atr.co.jp ---------------------------------------------------------------------- Path: news1.ucsd.edu!ihnp4.ucsd.edu!swrinde!newsfeed.internetmci.com!news.kei.com!bloom-beacon.mit.edu!senator-bedfellow.mit.edu!faqserv From: andrew@itl.atr.co.jp (Andrew Hunt) Newsgroups: comp.speech,comp.answers,news.answers Subject: comp.speech Frequently Asked Questions - part 3/3 Supersedes: Followup-To: comp.speech Date: 22 Dec 1995 14:10:49 GMT Organization: ATR International, Japan Lines: 2799 Approved: news-answers-request@MIT.Edu Expires: 2 Feb 1996 14:10:32 GMT Message-ID: References: Reply-To: andrew@itl.atr.co.jp (Andrew Hunt) NNTP-Posting-Host: bloom-picayune.mit.edu Summary: Information on Speech Technology X-Last-Updated: 1995/12/19 Originator: faqserv@bloom-picayune.MIT.EDU Xref: news1.ucsd.edu comp.speech:6604 comp.answers:13226 news.answers:51628 Archive-name: comp-speech-faq/part3 Last-modified: 1995/12/19 URL: http://www.speech.su.oz.au/comp.speech/ COMP.SPEECH FAQ POSTING - PART 3/3 [Note: this document has been automatically extracted from a WWW site: http://www.speech.su.oz.au/comp.speech This may introduce some formatting errors.] FAQ SECTION 5 - SPEECH SYNTHESIS * Q5.1: What is speech synthesis? * Q5.2: How can speech synthesis be performed? * Q5.3: References/Books on Synthesis * Q5.4: Speech Synthesis on the WWW * Q5.5: Speech Synthesis Software/Hardware Q5.1: WHAT IS SPEECH SYNTHESIS? Speech synthesis is the task of transforming written input to spoken output. The input can either be provided in a graphemic/orthographic or a phonemic script, depending on its source. Could someone provide a more informative description? Q5.2: PERFORMING SPEECH SYNTHESIS There are several algorithms. The choice depends on the task they're used for. The easiest way is to just record the voice of a person speaking the desired phrases. This is useful if only a restricted volume of phrases and sentences is used, e.g. messages in a train station, or schedule information via phone. The quality depends on the way recording is done. More sophisticated but worse in quality are algorithms which split the speech into smaller pieces. The smaller those units are, the less are they in number, but the quality also decreases. An often used unit is the phoneme, the smallest linguistic unit. Depending on the language used there are about 35-50 phonemes in western European languages, i.e. there are 35-50 single recordings. The problem is combining them as fluent speech requires fluent transitions between the elements. The intellegibility is therefore lower, but the memory required is small. A solution to this dilemma is using diphones. Instead of splitting at the transitions, the cut is done at the center of the phonemes, leaving the transitions themselves intact. This gives about 400 elements (20*20) and the quality increases. The longer the units become, the more elements are there, but the quality increases along with the memory required. Other units which are widely used are half-syllables, syllables, words, or combinations of them, e.g. word stems and inflectional endings. Q5.3: REFERENCES/BOOKS ON SYNTHESIS The following are good introductory books/articles. * Douglas O'Shaughnessy, Speech Communication: Human and Machine Addison Wesley series in Electrical Engineering: Digital Signal Processing, 1987. * D. H. Klatt, "Review of Text-To-Speech Conversion for English", Jnl. of the Acoustic Society of America (JASA), v82, Sept. 1987, pp 737-793. * "Talking Machines, Theories, Models and Designs" Eds, G. Bailly & C. Benoit (Elsevier: North Holland) * I. H. Witten. Principles of Computer Speech. (London: Academic Press, Inc., 1982). * John Allen, Sharon Hunnicut and Dennis H. Klatt, "From Text to Speech: The MITalk System", Cambridge University Press, 1987. The following book is a comprehensive bibliography of speech processing. * Computational Speech Processing: Speech Analysis, Recognition, Understanding, Compression, Transmission, Coding, Synthesis ; Text to Speech Systems, Speech to Tactile Displays, Speaker Identification, Prosody Processing : BIBLIOGRAPHY, by Conrad F. Sabourin, 1994, 2 volumes, 1187p, ISBN 2-921173-21-2, INFOLINGUA inc., P.O. Box 187 Snowdon, Montreal, H3X 3T4, Canada. Q5.4: SPEECH SYNTHESIS ON THE WWW There is a growing amount of information on speech synthesis available on the World Wide Web. Apart from the information in Q5.5, check out the following: Speech Synthesis "Museum" URL: http://www.cs.bham.ac.uk/~jpi/synth/museum.html Maintained by Jon Iles at the University of Birmingham. Information and speech samples for + YorkTalk + Loughborough Sound Images + University of Birmingham - FDFS + Eurovocs + DECtalk + AT&T Bell Labs Synthesiser + S.W.A.Ll.C. - Welsh Synthesis from CSTR + All-Prosodic Speech Synthesis - IPOX + Orator from Bellcore Say... http://wwwtios.cs.utwente.nl/say WWW demo of the rsynth speech synthesis software. The WWW capability was implemented by Axel Belinfante. AT&T Bell Laboratories Voices http://www.research.att.com/cgi-bin/voices.form/ WWW interface to the AT&T Bell Laboratories text to speech (TTS) synthesizer Yahoo page on speech generation http://www.yahoo.com/Science/Computer_Science/Artificial_Intelligenc e/Natural_Language_Processing/Speech_Generation/ Q5.5: SPEECH SYNTHESIS SOFTWARE/HARDWARE Please email any updates, corrections or additions to the following list. The range of commercially available synthesis software is growing rapidly so any help in keeping up to date will be appreciated. * AsTeR * TheBigMouth * CSRE: Canadian Speech Research Environment * DECTalk * Eloquence * Emacspeak - A Speech Output Subsystem For Emacs * Infovox Product Range * JSRU * Klatt-style synthesiser * KPE80 - A Klatt Synthesiser and Parameter Editor * "learph": Trainable text-to-phoneme software by Antonio Lucca * Lernout and Hauspie Text-To-Speech (3 products) * Lernout and Hauspie Text-To-Speech Windows SDK * Various Mac Speech Output Applications * MacinTalk * Monologue for Windows from First Byte * Narrator Translator Library * Narrator * TextToSpeech Kit (NeXT) * Orator from Bellcore * PAM - A Text-To-Speech Application * ProVerbe Speech Engine for Windows * ProVoice Developer's Speech Toolkit from First Byte * RC Systems V8600/V8601 Text to Speech synthesizers * rsynth * SENSYN speech synthesizer * SGI Developers Toolbox Synthesiser * SIMTEL * Sound Bytes DeveloperUs Kit * spchsyn.exe * Speak * Speech Manager and PlainTalk * Text to Phoneme Program 1 * Text to phoneme program 2 * Text to phoneme program 3 * Tinytalk * TrueTalk * TruVoice from Centigram AsTeR * Platform: UNIX * Description: TTS front-end program which encodes structural information about documents in speech synthesis. For more information check out: http://www.research.digital.com/CRL/personal/raman/aster/aster -toplevel.html * Operation requirements: Lisp: Lucid, clisp * Contact: T. V. Raman email: raman@crl.dec.com TheBigMouth - a Text to Speech Program * Platform: NeXT * Description: Text to speech program based on concatenation of pre-recorded speech segments. NeXT equivalent of "Speak" for Suns. * Availability: try NeXT archive sites such as sonata.cc.purdue.edu. CSRE: Canadian Speech Research Environment * Platform: PC * Cost: Distributed on a cost recovery basis. * Description: CSRE is a software system which includes in addition to the Klatt speech synthesizer, SPEECH ANALYSIS and EXPERIMENT CONTROL SYSTEM. A paper about the whole package can be found in: + Jamieson D.G. et al, "CSRE: A Speech Research Environment", Proc. of the Second Intl. Conf. on Spoken Language Processing, Edmonton: University of Alberta, pp. 1127-1130. * Hardware: Can use a range of data aqcuisition/DSP hardware. * Availability: For more information contact AVAAZ Innovations Inc. P.O.Box 8040 1225 Wonderland Rd. N London, Ontario, CANADA, N6G 2B0 Tel : (519) 472-7944 Fax : (519) 472-7814 Email: info@avaaz.com * Note: A more detailed description is given in Section 1.9 on speech environments. DECTalk * Description: Speech synthesis hardware and software. Detailed information on DECtalk and other DEC products is available on a World-Wide Web site. + http://www.digital.com/info.html For specific information on DECtalk, check out this www url: + http://www.digital.com/archive/pub/Digital/info/Customer-Update/940 620005.txt Eloquence * Platform: Windows, Solaris, SunOS, SGI, RS/6000 * Description: Software based text-to-speech package. Generates waveforms completely algorithmically instead of by concatenating waveforms, for maximum flexibility and naturalism. For instance, when the user requests a deeper voice, the software simulates a larger vocal tract, instead of simply pitch-shifting samples. Uses high-level linguistic parsing, which obviates the need for a huge dictionary. Handles numbers, acronyms, currency, etc. Includes a set of annotation symbols, for placing stress on particular words, expressing excitement/boredom, etc. Also allows phonetic input. Support for Windows DDL. Produces male and female voices for General American English. Dialects under development include Alabama, Brooklyn, and Boston. * Price: Flexible license agreements on application. * Availability: Eloquent Technology, Inc. 2389 North Triphammer Road Ithaca, NY 14850 Ph: (607) 607-266-7025 Fax: (607) 607-266-7030 Email: eti@plab.dmll.cornell.edu Emacspeak - A Speech Output Subsystem For Emacs * Platform: UNIX, Emacs * Description: Emacspeak is a speech output system that will allow someone who cannot see to work directly on a UNIX system. Emacspeak is built on top of Emacs. With emacspeak loaded, Emacs provides spoken feedback for everything you do. Emacspeak currently supports the new Dectalk Express speech synthesizer, as well as older versions of the Dectalk e.g. the MultiVoice. See the Emacspeak WWW page, the Emacspeak FAQ or the Emacspeak distribution for additional details. * Requirements: Requires GNU FSF Emacs 19 (version 19.23 or later) and TCLX 7.3B (Extended TCL) to run Emacspeak. * Availability: Emacspeak WWW page http://www.research.digital.com/CRL/personal/raman/emacspeak/e macspeak.html Emacspeak source http://www.research.digital.com/CRL/personal/raman/emacspeak/e macspeak.tar.gz * Contact: T. V. Raman Email: raman@adobe.com Email: raman@cs.cornell.edu Infovox Product Range * Description: Multilingual Text-to-speech systems, languages available: American English, British English, German, French, Spanish, Italian, Swedish, Norwegian, Icelandic, Danish and Finnish. * Product name:INFOVOX 500, PC BOARD + Product description: Half length expansion board for IBM PC, XT, AT, PS/2 model 30 or compatible personal computers. The board can also be connected via the serial port. Language and control program for downloading into RAM or mounted on EPROMs + Platform: for IBM PC, XT, AT, PS/2 model 30 or compatible + Delivered standard interface: MS DOS I/O driver * Product name: INFOVOX 600, OEM BOARD + Product description: OEM board built with CMOS IC's. Language and control program are stored in on-board fixed memory. + Platform: any, Interface: 9-pole D-SUB (RS 232-C) 300-9600 Baud. + Delivered standard interfaces: MS DOS I/O driver and interface to Apple Speech manager. * Product name: INFOVOX 700, DESKTOP UNIT + Product description: Desktop unit with built in Infovox 600 to be connected to any computer or terminal via an RS 232-C serial interface. Built in loudspeaker and rechargable battery for 4 hours use, and control knobs for continuous control of speech volume and speed. + Platform: any + Delivered standard interfaces: MS DOS I/O driver and interface to Apple Speech manager * Product name: INFOVOX 650, OEM BOARD + Product description: OEM-board built with CMOS IC's. Language and control program are stored in on-board memory. + Platform: any, Interface: 9 pole D-SUB (RS 232-C) 300-9600 Baud + Delivered standard interfaces: MS DOS I/O driver and interface to Apple Speech manager * Product name: INFOVOX 750, DESKTOP UNIT + Product description: Desktop unit with built in Infovox 650 to be connected to any computer or terminal via an RS 232-C serial interface. Built in loudspeaker and rechargable battery for 5 hours use, and a control knob for continuous control of speech volume. + Platform: any + Delivered standard interfaces: MS DOS I/O driver and interface to Apple Speech manager * Product name: Infovox 210, software for Apple Macintosh + Product description: Software based text-to-speech conversion. Produces 16 bit and 8 bit sound. Delivered on 3.5" diskettes with user lexicon and a complete documentation. + Platform: Apple Macintosh with minimum 68030, 33 MHz microprocessor. + Delivered standard interfaces: Standard interface to Apple Speech manager * Product name: Infovox 220, software for Microsoft Windows. + Product description: Software based text-to-speech conversion. Produces 16 bit sound and conforms to Microsoft Windows multimedia standard MCI. Delivered on 3.5" diskettes with user lexicon and a complete documentation. + Platform: IBM compatible PC with minimum 486, 25 MHz microprocessor. + Delivered standard interfaces: Standard interface to Microsoft Windows 3.1 and sound boards supporting Microsoft Windows multimedia driver for audio. * Contact: Telia Promotor Infovox AB TTS Sales Division P.O. Box 2069 S-171 02 Solna, Sweden Ph: +46 8 764 35 00 Fax: +46 8 735 78 76 email: tts-sales@infovox.se JSRU * Platform: UNIX and PC * Cost: 100 pounds sterling (from academic institutions and industry) * Description: A C version of the JSRU system, Version 2.3 is available. It's written in Turbo C but runs on most Unix systems with very little modification. A Form of Agreement must be signed to say that the software is required for research and development only. * Contact: Dr. E.Lewis eric.lewis@bristol.ac.uk) Klatt-style synthesiser * Platform: Unix * Cost: Free * Description: Software posted to comp.speech in late 1992. * Availability: By ftp from the comp.speech ftp site + ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/synthesis/klatt-3.04.ta r.gz + ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/synthesis/klatt-3.04.ta r.Z * See also: KPE80 - A Klatt Synthesiser and Parameter Editor. KPE80 - A Klatt Synthesiser and Parameter Editor * Platform: Unix * Description: The KPE80 program provides a graphical interface for the implementation of the Klatt 1980 formant synthesiser written by Jon Iles and Nick Ing-Simmons. It was inspired by IGE, a piece of code written by Rob Fletcher (http://www.york.ac.uk/~rpf1/IGE.html). * Technical Desc.: It is comprised of an X-Window interface and version 3.03 of the synthesiser code. The interface allows users to display and edit Klatt parameters using a graphical display which includes the time-amplitude waveform of both the original speech and its synthetic copy, and some signal analysis facilities. Most of the work in choosing the parameter values to produce the synthetic copy has to be done by the user. KPE will estimate the fundamental frequency contour from an original token; this estimate will need to be amended where errors occur. It is possible to specify the formant trajectories with some precision by overlaying the appropriate formant frequency parameter tracks on the spectrogram of the target waveform. A number of facilities exist to help in the refinement of parameter values: original and synthetic waveforms can be compared aurally, spectrally, and spectrographically using built-in speech analysis facilities. * File formats: KPE will read RIFF (.wav) files and SFS files. (SFS is a suite of speech-signal processing programs available free from Phonetics and Linguistics, UCL.) * Availability: KPE for SunOs 4.1.3 (statically compiled libraries) ftp://pitch.phon.ucl.ac.uk/pub/kpe/kpe80.sun413.tar.Z KPE for Linux (statically compiled libraries) ftp://pitch.phon.ucl.ac.uk/pub/kpe/kpe80.linux.tar.Z The source code (needs gcc and SUIT to compile) ftp://pitch.phon.ucl.ac.uk/pub/kpe/kpe80.src.tar.Z A postscript overview of KPE ftp://pitch.phon.ucl.ac.uk/pub/kpe/OVERVIEW.ps The SFS distribution ftp://pitch.phon.ucl.ac.uk/pub/sfs * See also: Public domain Klatt-style speech synthesis code. * Contact: Andrew Simpson Department of Phonetics and Linguistics University College London Wolfson House, 4 Stephenson Way, London NW1 2HE email: a.simpson@ucl.ac.uk WWW: http://www.phon.ucl.ac.uk/home/andrew/home.html "learph": Trainable text-to-phoneme software by Antonio Lucca * Platform: UNIX (unconfirmed) * Description: Experimental software which learns text to phoneme translation from examples. * Availability: Examples and source are available on the WWW: http://www.dsi.unimi.it/Users/Students/lucca/TTS/ttsdoc.html * Contact: Antonio Lucca: lucca@ghost.dsi.unimi.it Lernout & Hauspie Text-to-Speech (3 products) Lernout & Hauspie have three TTS products. The functionality of the products is similar, however, they differ in hardware implementation and other details where described below. * L&H tts2000/T: TTS for the Telephony and Telecommunications Market * L&H tts2000/M: TTS for the Computer and Multimedia Market * L&H tts3000/C: TTS for the Buisness and Consumer Electronics Market * Description: Text to Speech (TTS) software based on parameterized segment concatenation (diphones, triphones and tetraphones) algorithms. Available for US English, German, Dutch, French, Spanish (Castilian), Italian and Korean. General features include: + The control of volume, speech rate and speech pitch. + The use of control sequences to customize TTS output (adding pauses, using phonetic input, etc.). + Switching between languages at run time. + A personal vocabulary editor is available for building exception dictionaries. + Readout modes: letter by letter, word by word or sentence by sentence. + Input formats: orthographic input, phonetic input, phonetic input with prosodic information. * tts2000/T + Output formats: 8 bit mu-law PCM, 8 bit A-law PCM, 16 bit linear PCM. + Sampling Frequency: 8kHz + Single channel platform examples: SHARP SH7000, ARM6/ARM7, Intel i960, TI TMS320C31, AT&T DSP3210 + Multi-channel platform examples: TI TMS320C31, AT&T DSP3210 * tts2000/M + Output formats: 8/16 bit wave format, 8 bit mu-law PCM, 8 bit A-law PCM, 16 bit linear PC. + Sampling Frequency: 8/10/11.025 kHz + Single processor platform examples: ARM6/ARM7, Intel 386/486/Pentium, Motorola 68040 + Two processor platform examples: {Intel 386/486/Pentium or Motorola 68030} and {ADI ADSP21XX or Motorola 5600X or TI TMS320C25/20C5X} * tts3000/C + Output formats: 8 bit mu-law PCM, 8 bit A-law PCM, 16 bit linear PCM. + Sampling Frequency: 10kHz + Single processor platform examples: SHARP SH7000, ARM6/ARM7, Intel i960, TI TMS320C31, AT&T DSP3210 + Two processors platform examples: { SHARP SH7000 or ARM6/ARM7 or Intel 386EX or Motorola 683XX} and {ADI ADSP21XX or Motorola 5600X or TI TMS320C25/C5X or TI TSP50C10} * See also: L&H Windows TTS SDK * Price: Unknown * Contact: Lernout & Hauspie Speech Products 800 West Cummings Park, Suite 3100 Woburn, MA 01801, USA Tel: (617) 932 4118 Fax: (617) 932 9209 Email: sales@lhs.com Lernout & Hauspie Text-to-Speech Windows SDK * Platform: IBM-Compatible * Description: The L&H Text-to-Speech software developers kit is able to integrate text-to-speech technology with your own or existing PC applications under Microsoft Windows 3.1. This software will allow conversion of written text into clear human sounding synthetic speech. * Requirements: IBM-compatible PC 386 DX/33 + 8Mb RAM + MS DOS 5.0 + MS Windows 3.1 (or higher) + SoundBlaster compatible sound board. * See also: L&H TTS Products * Price: Unknown * Contact: Lernout & Hauspie Speech Products 800 West Cummings Park, Suite 3100 Woburn, MA 01801, USA Tel: (617) 932 4118 Fax: (617) 932 9209 Email: sales@lhs.com AddressSpeech info-mac 4D talking address book (from Speech Pack 2.0) At Ease 2.0 MacWarehouse Friendly desktop that speaks file names At Ease 2.0 WG MacWarehouse Friendly desktop that speaks file names Eliza 3.1 AOL Talking Eliza (Rogerian psych therapist) FB speech Inside Basic Mag, volume 3, no. 6. FutureBasic demo FB Speech demo Inside Basic Mag, volume 3, no. 7. FutureBasic demo Fortune 1.1 info-mac Like a talking UNIX fortune command - slick Homer 0.92d9 zaphod.ee.pitt.edu GUI IRC client, assign nicks voices - slick MacMessage 1.0 FirstClassBBS Share talking messages/customizable startup Say info-mac MPW Tool which converts standard input to speech ScriptTools 1.2 info-mac Write AppleScript scripts to say text messages Siege Watch 1.01f info-mac Wryly political speaking clock SoToSpeak1.0.0b10 info-mac Two voice conversation (also see Fortune's About) Speak It! info-mac Type in a message and have it spoken Speaker 1.11 info-mac Simple text file editor, speaks on CR, macros Speecher 1.2.1 info-mac Customizable word pronunciation/substitution SpeechManagerdemo info-mac Command line interface, C source, aka -explorer Speech Pack 2.0 info-mac 4th Dimension external, add speech to database speek-02b info-mac Speech XCMD for HyperCard TalkingClockPro2.0info-mac AppleScriptable talking clock extension (2.0b0) TeachText 7.2 AV Mac Apple's talking TeachText (simple editor w/QT) Tex-Edit 1.9 AOL Talking word processor, McSink like, modeming VoiceDemo 1.0.1 info-mac Bare bones phrase talker Welcome!v1.3.1 info-mac A talking Welcome to Macintosh startup ? ? Talking Plug-In-Module for MS Word 5, experimental, unsupported, buggy, beware! Speech Rhythms AOL A cool text file for one of the above apps _____ * Sources: + AOL = America Online + info-mac = {ftp sumex-aim.stanford.edu, ftp wuarchive.wustl.edu, et al.} + MacWarehouse = (800) 255-6227 * Misc: Apple's work in spoken language technologies and systems is described in: + Lee, Kai-Fu. "The Conversational Computer: An Apple Perspective." (Keynote Speech) In Proc. Eurospeech in Berlin, September, 1993. MacinTalk * Platform: Macintosh * Cost: Free * Description: Formant based speech synthesis. There is also a program called "tex-edit" which apparently can pronounce English sentences reasonably using Macintalk. * Note: MacinTalk doesn't run reliably on Macintosh's with new sound hardware under the lastest OS (System 7.1 w/HUD 2.0). More recent software is listed above. * Availability: By anonymous ftp from many archive sites (have a look on archie if you can). tex-edit is on many of the same sites. Try ftp://wuarchive.wustl.edu/mirrors2/info-mac/Old/card/macintal k.hqx ftp://wuarchive.wustl.edu/mirrors2/info-mac/Old/card/macintal k-stack.hqx ftp://wuarchive.wustl.edu/mirrors2/info-mac/app/tex-edit-15.h qx Monologue for Windows from First Byte * Description: Monologue, a software program that reads text from the clipboard in Windows 16 or 32 bit applications, can be found as a bundled product with many sound cards and multimedia general purpose computer systems. It is not offered as a separate product at this time. Monologue can add the element of speech to virtually any text oriented application. Any pronounceable combination of letters and numbers will be spoken clearly. It can be applied to tasks such as eyes-free proofreading, data verification (e.g. spreadsheets), reading E-mail and more. User-changeable parameters provide control over the sound quality by allowing for changes in pitch, and the speed of speech. An exception dictionary saves preferred pronunciation of words and abbreviations. Monologue works with sound devices that comply with the Windows Sound API. Monologue male "SpeechFonts" are available for US English, British English, German, French, Latin American Spanish, Italian. A US English Female SpeechFont is also available. * Availability: Currently bundled with many sound cards and multimedia general purpose computer systems. Monologue will soon be available as a stand-alone product. Single user and site licenses as well as Distributor discounts will be offered. * WWW: For more detailed information and examples go to the First Byte WWW page: http://www.firstbyte.davd.com/ * See also: ProVoice Developer's Speech Toolkit from First Byte * Contact: First Byte 19840 Pioneer Ave., Torrance, CA 90503 Ph: 310-793-0610 Fax: 310-793-0611 Email: info@firstbyte.davd.com WWW: http://www.firstbyte.davd.com/ Narrator Translator Library * Platform: Amiga * Description: A replacement for the Commodore-supplied "translator.library" which is a part of the Narrator speech synthesis package. It implements multi-lingual text-to-speech for an Amiga. The library allows the user to specify the language the text to be spoken should be translated as. This can be done by setting the default language or by including markup codes in the text in a similar way to Latex or Html. eg: "\french{Bonjour}". There is currently support for American English, British English, Swedish, Maori, Finnish, German, Icelandic, Klingon, Polish, Italian, and Welsh. * Availability: The library (but not source) is available by anonymous ftp from Aminet: ftp://ftp.doc.ic.ac.uk/pub/aminet/util/libs/translator42.lha * More Information: is available on the WWW. http://www.sans.vuw.ac.nz/~ffranc/translator/index.html Narrator * Platform: Amiga * Description: Formant based speech synthesis. Includes a Engish-to-phoneme translation library, and a SPEAK: pseudo-device for speech output. * Hardware: Standard Amiga hardware * Availability: Part of AmigaOS * See Also: The Narrator Translation library TextToSpeech Kit * Platform: NeXT Computers * Description: The TextToSpeech Kit does unrestricted conversion of English text to synthesized speech in real-time. The user has control over speaking rate, median pitch, stereo balance, volume, and intonation type. Text of any length can be spoken, and messages can be queued up, from multiple applications if desired. Real-time controls such as pause, continue, and erase are included. Pronunciations are derived primarily by dictionary look-up. The Main Dictionary has nearly 100,000 hand-edited pronunciations which can be supplemented or overridden with the User and Application dictionaries. A number parser handles numbers in any form. A letter-to-sound knowledge base provides pronunciations for words not in the Main or customized dictionaries. Dictionary search order is under user control. Special modes of text input are available for spelling and emphasis of words or phrases. The actual conversion of text to speech is done by the TextToSpeech Server. The Server runs as an independent task in the background, and can handle up to 50 client connections. * Misc: The TextToSpeech Kit comes in two packages: the Developer Kit and the User Kit. The Developer Kit enables developers to build and test applications which incorporate text-to-speech. It includes the TextToSpeech Server, the TextToSpeech Object, the pronunciation editor PrEditor, several example applications, phonetic fonts, example source code, and developer documentation. The User Kit provides support for applications which incorporate text-to-speech. It is a subset of the Developer Kit. * Hardware: Uses standard NeXT Computer hardware. * Cost: + TextToSpeech User Kit: $175 CDN ($145 US) + TextToSpeech Developer Kit: $350 CDN ($290 US) + Upgrade from User to Developer Kit: $175 CDN ($145 US) * Availability: Trillium Sound Research 1500, 112 - 4th Ave. S.W., Calgary, Alberta, Canada, T2P 0H3 Tel: (403) 284-9278 Fax: (403) 282-6778 Order Desk: 1-800-L-ORATOR (US and Canada only) Email: TTSInfo@trillium.ab.ca Orator Text-to-Speech Synthesizer * Platform: SUN SPARC, Decstation 5000. Written in C, and therefore portable to other UNIX platforms. Some successful ports: HP, RS-6000, PC-Unix [Linux]. * Description: Sophisticated speech synthesis package. Has text preprocessing (for abbreviations, numbers), acronym rules, and human-like spelling routines. Natural-sounding synthesis based on demisyllable concatenation. Has high accuracy for pronunciation of names of people, places and businesses in America; good accuracy for English text; rules for stress and intonation marking; various methods of user control and customization at most stages of processing. A new version of the ORATOR system is under development. Both ORATOR and this new "ORATOR II" system are capable of general text synthesis. The ORATOR II system has a more natural-sounding voice. * Hardware: Runs on common SPARC or Decstation workstations, using their internal audio output capability. Recommend at least 16M of memory. * WWW: More detailed information plus examples of ORATOR synthesis are available on the ORATOR WWW pages: http://www.bellcore.com/demotoo/ORATOR/index.html * Misc 1: A free demo cassette is available. * Misc 2: Examples of Orator are also available on the University of Birmingham Speech Synthesis "Museum" WWW site (see Q5.4). * Availability and Pricing: Contact Bellcore's Licensing Office Tel: 1-800-521-CORE (521-2673) Fax: 1-908-336-2559 Email: Anthony Lindsey: alin1@panix.com WWW: http://www.bellcore.com/demotoo/ORATOR/index.html PAM - A Text-To-Speech Application * Platform: Windows * Description: PAM is a talking personal assistant and text reader application. It uses the ProVoice TTS package. PAM will verbally advise about appointments and reminder messages at specified times during the day. It can read text files, clipboard text, and text sent in DDE messages. Using the full verbal interface, PAM can be used by visually challenged individuals. Shareware - thirty day free trial. * Requirements: Any Windows sound card, speakers or headphones. Min. memory - 4 megs, 8 megs recommended. * WWW: A more complete description is available on the JTS homepage: http://www.islandnet.com/~tslemko/homepage.html * Availability: The shareware can be downloaded by ftp from ftp://ftp.islandnet.com/jts/pam_en1e.zip. The file size is approx. 1 MByte. * Price: $US40 for the registered version. * Contact: Tom Slemko: e-mail: tslemko@islandnet.com, or, JTS Micro Consulting Ltd 10931 Lytton Road, RR#4, Ladysmith, B.C., Canada, V0R 2E0 ProVerbe Speech Engine for Windows (95 and NT) * Description: The ProVerbe Speech Engine produces natural sounding speech from any written text. A high level of naturalness is achieved by using the TD-PSOLA process from the CNET (France telecom's research lab.) which is based on the concatenation of elementary speech units (including diphones). Supported languages are British English, German, French and Spanish. For multi-channel applications Elan Informatique also provides hardware platforms. * Demo: Anonymous ftp from ftp://www.cict.fr/pub/elan/ * Contact: Elan Informatique 4 rue Jean Rodier, 31400 TOULOUSE FRANCE Contact person: Pierre Delrat Phone: +33 61 36 07 77 Fax: +33 61 36 07 70 BBS: +33 61 36 07 88 E-mail: 101346,465@compuserve.com ProVoice Developer's Speech Toolkit from First Byte * Platform: ProVoice Developer's Toolkits are available for DOS, Windows 3.1, Windows 95, Windows NT, OS/2, and Macintosh. * Description: ProVoice allows programmers to add synthesized speech to their applications. Your program passes text strings to the ProVoice speech engine that translates text into audible speech. Male and/or female "SpeechFonts" are available for many languages; English, French, German, UK British English, Italian, and Spanish. ProVoice converts text to speech in two phases using a set of phonetic translation and pronunciation rules. First, the software analyzes and translates text into "sound descriptors", a phonetic language with pitch, duration, and amplitude codes which are needed to produce stress patterns in phrases and sentences. Rules are used to analyze words, numbers, and punctuation. The second phase converts the intermediate phonetic language in speech signals; algorithms drive distinct speech signals into smooth flowing, continuous, clear speech. Real time synchronization of mouth movement and word boundaries allows animation of a graphical talking character, or highlighting of displayed text as it is spoken. Necessary tools and examples are provided for programmers to manipulate the ProVoice speech technology; including installation instructions, extensive samples programs, and complete documentation. In addition, sample code is provided on disk to illustrate speech programming techniques. * Note 1: First Byte will perform custom work for embedded systems. * Note 2: ProVoice Windows will speak through any Windows-supported wave audio device. * Note 3: Distribution of ProVoice for commercial use is subject to execution of a Commercial Product Distribution License Agreement. * WWW: For more detailed information and examples go to the First Byte WWW page: http://www.firstbyte.davd.com/ * See also: Monologue for Windows from First Byte * Price and Availability: Contact First Byte * Contact: First Byte 19840 Pioneer Ave., Torrance, CA 90503 Ph: 310-793-0610 Fax: 310-793-0611 Email: info@firstbyte.davd.com WWW: http://www.firstbyte.davd.com/ RC Systems V8600/V8601 Text to Speech synthesizers * Platform 1: IBM PC: ISA card. * Platform 2: Interface to PC/104 standard microcontrollers. * Platform 3: Standalone (or embedded) thru RS232 or parallel printer port or processor bus. * Description: Converts plain ASCII text to speech. Programmable voices, pitch rate, volume, etc. Built-in DTMF and tone generators. * Price: $151-$299 US (qty 1) * Contact: RC Systems 1609 England Avenue, Everett, WA 98203, USA Ph: (206) 355-3800 Fax: (206) 355-1098 Europe: +44181 539-0285 rsynth * Platform: Various (including Solaris2.3, SunOS4.1.3, HPUX, SGI Irix4.x, Linux) * Description: Public domain text-to-speech systm assembled from a variety of sources. It supports CMU and BEEP format dictionaries (as described in Q1.10) and now utilises stress marks in the dictionary in synthesising intonation. * Price: Free * Misc: Axel Belinfante has implemented a WWW rsynth demo: http://wwwtios.cs.utwente.nl/say. * Availability: by anonymous ftp from ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/synthesis/rsynth-2 .0.tar.Z ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/synthesis/rsynth-2 .0.tar.gz SENSYN speech synthesizer * Platform: PC, Mac, Sun, and NeXt * Rough Cost: $300 * Description: This formant synthesizer produces speech waveform files based on the (Klatt) KLSYN88 synthesizer. It is intended for laboratory and research use. Note that this is NOT a text-to-speech synthesizer, but creates speech sounds based upon a large number of input variables (formant frequencies, bandwidths, glottal pulse characteristics, etc.) and would be used as part of a TTS system. Includes full source code. * Availability: Sensimetrics Corporation 64 Sidney Street, Cambridge MA 02139. Fax: (617) 225-0470; Tel: (617) 225-2442. Email: sensimetrics@sens.com SGI Developers Toolbox Synthesiser * Platform: SGI * Description: The SGI Developer Toolbox 4.0 CDROM contains a basicpublic domain text-to-speech program in the publics/speak directory. The directory includes man pages and source. * Availability: on the SGI Developer Toolbox 4.0 CDROM SIMTEL A wide range of speech related software, sound-blaster software and signal processing software for PCs is available on SimTel and its mirror sites. It can be obtained by ftp from: ftp://oak.oakland.edu/SimTel/msdos/voice/ and is now on the WWW: http://www.acs.oakland.edu/oak/SimTel/win3/sound.html Voicemaker The archives include the program Voicemaker which synthesises speech from phonemes using "concatenation" of phonemes recorded by the user. Voicemaker is a freeware program. It requires an IBM or compatible, 512KB RAM, sound blaster compatible sound card. ftp://oak.oakland.edu/SimTel/msdos/voice/vm110.zip Sound Bytes DeveloperUs Kit * Platform: Subroutine library for PC (MS-Windows, OS/2) and Macintosh * Hardware: Windows - 16 MHz 80386 (minimum) running Windows 3.1; 4 Mb RAM with at least 1.4 Mb RAM free. Disk space 1.4 Mb. OS/2 - 16 MHz 80386 (minimum) running OS/2 2.0 or above; 8 Mb RAM with at least 1.4 Mb RAM free. Mac - Any Mac with at least 2.5 Mb of RAM running 6.0.4 or higher. Telephone compatible. Compatible with commonly used sound cards. * Description: SBDK is a software-only sentence-level synthesizer that converts unrestricted English text (ASCII) into synthesized voice through diphone concatenation. SBDK utlizes parsing to incorporate the intonational and rhythmic patterns of normal speech. The developerUs kit includes two voices, one female and one male. The product has a 55,000-word built-in dictionary and a tool for creating customized user dictionaries. It converts numbers, dates, dollars, phone numbers and times to words, and has a SoundOut facility that provides a choice of pronouncing unknown words phonetically or spelling them out. Developers can vary voice pitch (130-220 Hz) and rate (65-200 wpm), synchronize speech to other events, have multiple channels of speech to the same or different boards, etc. Speech sampling options: 8-bit linear; 8-bit companded at 11 kHz (Windows); 8-bit mu-law PCM at 8 or 11 kHz; 16-bit linear at 11 kHz. * Cost: Sound Bytes may be licensed for internal use or resale. Site license fee= $3750. Resale or Internal runtime fees= 2% of net sales price per runtime sold, OR $150 per telephone port, OR per unit pricing for internal use determined case-by-case. * Misc: Demo disks are available for Windows and the Mac. * Availability: Natural Speech Technologies, Inc. - (619) 457-2526. spchsyn.exe * Platform: PC? * Availability: By anonymous ftp as a self extracting DOS archive. ftp://evans.ee.adfa.oz.au/mirrors/tibbs/applications/spchsyn. exe * Requirements: May require special TI product(s), but all source is there. "Speak" - a Text to Speech Program * Platform: Sun SPARC * Description: Text to speech program based on concatenation of pre-recorded speech segments. A function library can be used to integrate speech output into other code. * Hardware: SPARC audio I/O * Availability: by anonymous ftp ftp://wilma.cs.brown.edu/pub/speak.tar.Z Speech Manager and PlainTalk * Platform: Macintosh * Cost: Free * Description: Apple's text-to-speech system extensions that enable applications to perform text-to-speech conversion. The Speech Manager runs on most Macs, but PlainTalk (and the high quality voices) requires a 68020 Mac or better. * Availability: By anonymous ftp from: ftp://ftp.support.apple.com/pub/apple_sw_updates/US/Macintosh/ system_sw/PlainTalk 1.4.1 This directory contains subdirectories for recent versions of PlainTalk. The current release (PlainTalk 1.4.1) contains the English Text-To-Speech with about a dozen voices ( English_Text-to-Speech.hqx: 5.3 MByte), Mexican Spanish ( Mexican_Spanish_TTS.hqx: 2.8 MByte), and the English Speech Recognition software ( English_Speech_Recognition.hqx: 2.3MByte). * WWW: The latest information is available from Apple's WWW page for speech recognition and synthesis: http://www.info.apple.com/apple.speech/ * Note: Joshua Baer (shaddar+@cmu.edu) runs a mailing list for Plaintalk. To subscribe, send email to plaintalk@thelorax.mac.cc.cmu.edu with the word subscribe as the subject. There is also a WWW page with links to ftpable software. http://www.contrib.andrew.cmu.edu/usr/jbbt/plaintalk/plaintalk .html Text to phoneme program (1) * Platform: unknown * Description: Text to phoneme program. Based on Naval Research Lab's set of text to phoneme rules. * Availability: by anonymous ftp ftp://shark.cse.fau.edu/pub/src/phon.tar.Z Text to phoneme program (2) * Platform: unknown * Description: Text to phoneme program. * Availability: by anonymous ftp ftp://wuarchive.wustl.edu/mirrors/unix-c/utils/phoneme.c Text to phoneme program (3) * Description: A public domain version of the same Naval Research Lab text to phoneme rules. * Availability: By anonymous ftp ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/synthesis/english2 phoneme.shar Tinytalk * Platform: PC * Description: Shareware package is a speech 'screen reader' which is used by many blind users. * Price: Tinytalk is now $150. There are package deals on Tinytalk with various speech synthesizers. * Availability: Tinytalk is available by anonymous ftp from the following site Files: ttdoc167.zip and ttdoc167.zip (executable and documenation) ftp://ftp.netcom.com/pub/eb/ebohlman/ (Note: it is a busy ftp server.) * Contact: Eric Bohlman OMS Development 610-B Forest Ave., Wilmette, IL 60091 Ph: (800)831-0272 Fax: 708-251-5793 Outside North America: (708)-251-5787 Email: ebohlman@netcom.com TrueTalk * Platform: Sun Sparcstation 1+/2/LX/5/10/20 with SunOS 4.1.3, or SGI Indy/Indigo/Indigo2 with IRIX 5.2. Other platforms in development. * Description: Personal TrueTalk, by Entropic Research Laboratory, Inc., is an all-software Text-to-Speech (TTS) system designed to voice-enable UNIX X-Windows workstations. It combines a graphical interface with a powerful TTS engine based on technology developed by AT& Bell Laboratories. Features include: + Intelligible, prosodically natural speech. + Text taken from file input, highlighted X selections, the interface scratch pad, other programs connected through a TCP/IP socket, or Tcl/Tk applications via the Tk "send" mechanism. + Stop, pause and resume while speech is in progress. + Visual indication of corresponding text position when paused. + Nine speaking voices, with Male and Female versions of each voice. + Adjustable speaking rate and volume. + Supports drop-in text filters; "email" and "lively" examples included. + Audio output through workstation headphones or speaker. + Complete on-line documentation, including mouse-activated help windows. * Misc: A more detailed description of TrueTalk is available on the Entropic WWW server: http://www.entropic.com/truetalk.com * Availability: You can obtain Personal TrueTalk through the Internet. For details, see ftp://ftp.entropic.com/pub/truetalk/README.ptt Personal TrueTalk is available free of charge for evaluation purposes. You can fully-enable your evaluation copy at any time by purchasing a license key from Entropic. * Requirements: 12MB disk space, 8MB process size (24MB system RAM recommended). * Cost: US$495; US$395 academic * Contact: Entropic Research Laboratory, Inc., Washington, D.C. Voice: 1-800-ENTROPIC (North America), (202) 547 1420 Fax: (202) 547-6648 Email: truetalk@entropic.com WWW: http://www.entropic.com/ TruVoice from Centigram * Platform: Windows-NT, Windows 95, Windows 3.1 (limited release), OS/2, Sun Solaris 1&2 * Description: TruVoice., an advanced text-to-speech converter, is available for multiple environments. TruVoice converts text into spoken language. TruVoice adds intelligible, natural-sounding speech to sound enabled platforms. + No vocabulary restrictions + User-definable pronunciation dictionary + Accurately pronounces surnames and place names + Preprocessor provides e-mail and spreadsheet reading capabilities and expands abbreviations. + Multiple languages available: American English, Latin American Spanish, German, French, Italian + Flexible pitch, volume and speech rate + Intonation support for punctuation + Supports navigational capabilities such as, pause, resume and jump forward / jump back More detailed information is provided in the brochure page on the Centigram WWW pages. A demonstration of TruVoice is available on the Centigram WWW pages. * Cost: + Windows versions are $295 for the SDK + Solaris versions are $995 + Contact Centigram for other pricing. * Contact: Christine Hansen Centigram Communications Corporation 91 East Tasman Drive, San Jose, CA 95134 Tel: 408/944-0250 Fax: 408/428-3732 Email: chris.hansen@centigram.com WWW: http://www.centigram.com/ ___________________________________________________________________________ FAQ SECTION 6 - SPEECH RECOGNITION * Q6.1: What is speech recognition? * Q6.2: How is speech recognition performed? * Q6.3: How can I build a simple speech recogniser? * Q6.4: References & books on speech recognition * Q6.5: Speech Recognition Hardware/Software Q6.1: WHAT IS SPEECH RECOGNITION? Automatic Speech Recognition Automatic speech recognition is the process by which a computer maps an acoustic speech signal to text. Automatic speech understanding is the process by which a computer maps an acoustic speech signal to some form of abstract meaning of the speech. What does speaker dependent / adaptive / independent mean? A speaker dependent system is developed to operate for a single speaker. These systems are usually easier to develop, cheaper to buy and more accurate, but not as flexible as speaker adaptive or speaker independent systems. A speaker independent system is developed to operate for any speaker of a particular type (e.g. American English). These systems are the most difficult to develop, most expensive and accuracy is lower than speaker dependent systems. However, they are more flexible. A speaker adaptive system is developed to adapt its operation to the characteristics of new speakers. It's difficulty lies somewhere between speaker independent and speaker dependent systems. What does small/medium/large/very-large vocabulary mean? The size of vocabulary of a speech recognition system affects the complexity, processing requirements and the accuracy of the system. Some applications only require a few words (e.g. numbers only), others require very large dictionaries (e.g. dictation machines). There are no established definitions, however, try * small vocabulary - tens of words * medium vocabulary - hundreds of words * large vocabulary - thousands of words * very-large vocabulary - tens of thousands of words. What does continuous speech or isolated-word mean? An isolated-word system operates on single words at a time - requiring a pause between saying each word. This is the simplest form of recognition to perform because the end points are easier to find and the pronunciation of a word tends not affect others. Thus, because the occurrences of words are more consistent they are easier to recognise. A continuous speech system operates on speech in which words are connected together, i.e. not separated by pauses. Continuous speech is more difficult to handle because of a variety of effects. First, it is difficult to find the start and end points of words. Another problem is "coarticulation". The production of each phoneme is affected by the production of surrounding phonemes, and similarly the the start and end of words are affected by the preceding and following words. The recognition of continuous speech is also affected by the rate of speech (fast speech tends to be harder). Q6.2: HOW IS SPEECH RECOGNITION PERFORMED? A wide variety of techniques are used to perform speech recognition. There are many types of speech recognition. There are many levels of speech recognition / analysis / understanding. Typically speech recognition starts with the digital sampling of speech. The next stage is acoustic signal processing. Most techniques include spectral analysis; e.g. LPC analysis, MFCC, cochlea modelling and many, many more. The next stage is recognition of phonemes, groups of phonemes and words. This stage can be achieved by many processes such as DTW (Dynamic Time Warping), HMM (hidden Markov modelling), NNs (Neural Networks), expert systems and combinations of techniques. HMM-based systems are currently the most commonly used and most successful approach. Most systems utilise some knowledge of the language to aid the recognition process. Some systems try to "understand" speech. That is, they try to convert the words into a representation of what the speaker intended to mean or achieve by what they said. Q6.3: HOW CAN I BUILD A SIMPLE SPEECH RECOGNISER? Doug Danforth provides a detailed account in article 253 in the comp.speech archives. A summary is provided below. It is also available by anonymous ftp ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/info/DIY_SpeechRecognit ion QUICKY RECOGNIZER sketch: Here is a simple recognizer that should give you 85%+ recognition accuracy. The accuracy is a function of the words you have in your vocabulary. Long distinct words are easy. Short similar words are hard. You can get 98+% on the digits with this recognizer. Overview: * Find the begining and end of the utterance. * Filter the raw signal into frequency bands. * Cut the utterance into a fixed number of segments. * Average data for each band in each segment. * Store this pattern with its name. * Collect training set of about 3 repetitions of each pattern (word). * Recognize unknown by comparing its pattern against all patterns in the training set and returning the name of the pattern closest to the unknown. Many variations upon the theme can be made to improve the performance. Try different filtering of the raw signal and different processing methods. Q6.5 contains information on public domain speech recognition software: Lotec and Myers' Hidden Markov Model software. Q6.4: REFERENCES & BOOKS ON SPEECH RECOGNITION PRODUCT REVIEWS AND COMPARISONS Comparisons of speech recognition products (this article is already a year out of date). * "Talk Show", Wayne Rash Jr., PC Magazine (USA), Dec 20, 1994. * "Seybold Report on Desktop Publishing" published a nine-page, head-to-head comparison of Dragon's DOS software with IBM's OS/2 software. March 7, 1994; Volume 8, Number 7; Pages 3-11; ISSN:0889-9762; Seybold Publications, P.O. Box 644, Media, PA 19063 USA, phone (610) 565-2480. * McGraw-Hill Inc.'s "BYTE, the Magazine of Technology Integration," published a two-page review of IBM's Personal Dictation System software. May 1994; Volume ?, Number ?; Pages 145-146; ISSN:0360-5280; Editorial, Executive, and Circulation address: One Phoenix Mill Lane, Peterborough, NH 03458 USA, phone ? TECHNOLOGY: GENERAL AND INTRODUCTORY Some general introduction books on speech recognition technology: * Fundamentals of Speech Recognition; Lawrence Rabiner & Biing-Hwang Juang Englewood Cliffs NJ: PTR Prentice Hall (Signal Processing Series), c1993, ISBN 0-13-015157-2 * Speech recognition by machine; W.A. Ainsworth London: Peregrinus for the Institution of Electrical Engineers, c1988 * Speech synthesis and recognition; J.N. Holmes Wokingham: Van Nostrand Reinhold, c1988 * Speech Communication: Human and Machine, Douglas O'Shaughnessy; Addison Wesley series in Electrical Engineering: Digital Signal Processing, 1987. * Electronic speech recognition: techniques, technology and applications, edited by Geoff Bristow, London: Collins, 1986 * Readings in Speech Recognition; edited by Alex Waibel & Kai-Fu Lee. San Mateo: Morgan Kaufmann, c1990 TECHNICAL * Hidden Markov models for speech recognition; X.D. Huang, Y. Ariki, M.A. Jack. Edinburgh: Edinburgh University Press, c1990 * Speech Recognition: The Complete Practical Reference Guide; T. Schalk, P. J. Foster: Telecom Library Inc, New York; ISBN O-9366648-39-2; 377 pages; paperback only. Covers speech recognition in a telephony environment and wish to use call processing hardware based in PCs. It is written using Dialogic hardware as the example for the hardware. * Automatic speech recognition: the development of the SPHINX system; by Kai-Fu Lee; Boston; London: Kluwer Academic, c1989 * An Introduction to the Application of the Theory of Probabilistic Functions of a Markov Process to Automatic Speech Recognition, S. E. Levinson, L. R. Rabiner and M. M. Sondhi; in Bell Syst. Tech. Jnl. v62(4), pp1035--1074, April 1983 * Review of Neural Networks for Speech Recognition, R. P. Lippmann; in Neural Computation, v1(1), pp 1-38, 1989. BIBLIOGRAPHY The following book is a comprehensive bibliography of speech processing. * Computational Speech Processing: Speech Analysis, Recognition, Understanding, Compression, Transmission, Coding, Synthesis ; Text to Speech Systems, Speech to Tactile Displays, Speaker Identification, Prosody Processing : BIBLIOGRAPHY, by Conrad F. Sabourin, 1994, 2 volumes, 1187p, ISBN 2-921173-21-2, INFOLINGUA inc., P.O. Box 187 Snowdon, Montreal, H3X 3T4, Canada. Q6.5: SPEECH RECOGNITION HARDWARE & SOFTWARE The number of speech recognition packages, and the information about the software is changing rapidly. Any help with keeping this information up to date will be appreciated. Speech Recognition Processors (ICs) Jean-Pierre Lereboullet has put together a detailed list of Voice Recognition Processors which covers about 15 ICs and pieces of related hardware (including D6106, HM2007, MSM6679, RSC-164, TC8860F/64F/65F, 5A128). The document is available on the comp.speech ftp server: ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/info/VoiceRecogn itionProcessors Recognition Information on the WWW In addition to the entries on speech recognition in this FAQ, the following WWW sites provide information on speech recognition: Commercial Speech Recognition: Russ Wilcox of PureSpeech Inc. http://www.tiac.net/users/rwilcox/speech.html Yahoo pages on Speech Recognition http://www.yahoo.com/business/corporations/computers/software/ voice_recognition/ http://www.yahoo.com/Science/Computer_Science/Artificial_Intel ligence/Natural_Language_Processing/Speech_Recognition/ IN THE FAQ... The following speech recognition software/hardware is described in the comp.speech FAQ. * AbbotDemo * BBN Hark Telephony Recognizer * Corona Speech Recognition System * Custom Voice(TM) by A&G Graphics Interface * D6006 Voice Control Processor * DATAVOX - French * Digital Dreams Speech Recognition Plug-Ins * DragonDictate version 3.0 * DragonDictate for Windows * DragonVoiceTools * DSP Semiconductor Recognition Chip * EARS: Single Word Recognition Package * HM2007 - Speech Recognition Chip * Hidden Markov Model Toolkit (HTK) from Entropic * IBM VoiceType Dictation * ICSS system from IBM * IN3 Voice Command * IN3 Voice Command for Windows * Kurzweil Voice for Windows * Lernout & Hauspie ASR (3 products) * Lernout & Hauspie ASR SDK * Listen for Windows 2.0 - Verbex Voice Systems * Lotec Speech Recognition Package * Myers' Hidden Markov Model software * NCC Dictate * OKI VRP6679 - Speech Recognition Chip * Speech Systems Phonetic Engine 500 (PE500) * PowerSecretary * ProNotes Voice Tools (due late '95) * PureSpeech * recnet * SayIt * Simon Says - for NeXT * Speech Commander - Verbex Voice Systems * 'Speech Recognition Expert' Toolkit for Windows * Visual Voice from Stylus Innovation * Voice Command Line Interface * Voice Control Systems Recognition * Visus SpeechKit * VCS 2030 & 2060 Voice Dialer * Voice-Trek 2.0 * Creative VoiceAssist * Voice Blaster Ver. 4.0 * VoiceServer for Windows * Votan * Voice Processing Corporation Speech Recognition Product Line AbbotDemo * Platform: SunOS4, IRIX, Linux, HU-UX * Description: Large vocabulary, speaker independent, continuous automatic speech recognition system. Uses recurrent neural networks and hidden Markov models with a 5,000 word vocabulary upgradable) and a trigram word grammar. Includes a front end for waveform capture and display (including spectrogram) and a graphical display of the phoneme representation as well as a rewriting display of the best guess word sequence. * Requirements: UN*X, X, 8 Mbyte free RAM, 486DX or faster processor, 16 bit soundcard, reasonable quality microphone and a copy of the Wall Street Journal newspaper. * Price: Free for non-commercial use * Availability: By anonymous ftp from ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/recognition/AbbotDemo * Note 1: This is not a complete system for dictation. * Note 2: At present there are no sources with this distribution. For sources for an earlier version see the recnet entry. * Note 3: Not supported. * Contact: AbbotDemo@compute.demon.co.uk Tony Robinson Cambridge University Engineering Department Trumpington Street, Cambridge, CB2 1PZ, UK Tel: +44-1223-332815 Fax: +44-1223-332662 BBN Hark Telephony Recognizer * Platform: Available for Unix-based workstation and PC hardware platforms including IBM RS6000/AIX and Pentium/SCO Unix. * Description: Large vocabulary (2,000+ words), speaker independent, continuous ASR software. Specifically designed for large scale telephony applications. Using a client/server architecture, all features and capabilities are integrated in one software product instead of on separate boards. Very memory efficient, the Hark Telephony Recognizer runs in as little as 2MB of physical memory. Multiple recognizers can be run on a single platform. Uses Hidden Markov Model and phoneme-based BBN recognition algorithms. An API is provided for integration with existing applications. A developer's toolkit is available. * Price and availability: Price varies depending on vocabulary size. Version 3.0 available immediately. * Misc: BBN Hark provides application design and human factors consulting services. Regular monthly training classes on developing speech-enabled applications are held at BBN Hark's Cambridge (Mass) headquarters. * WWW: For additional information, see BBN Hark's home page on the Web at http://www.bbn.com/bbn_hark/HarkHome.html. * Contact: BBN Hark Systems 70 Fawcett Street, Cambridge, MA 02138 Tel: 617-873-4636 Fax: 617-873-2473 WWW: http://www.bbn.com/bbn_hark/HarkHome.html Corona Speech Recognition System * Platform: Unknown * Description: The Corona System is a UNIX-based, multi-channel recognition system designed for telephony-based applications. It features speaker-independent, continuous speech recognition over standard telephone lines and includes a natural language understanding capability. The natural language capability significantly enhances throughput performance of the application and makes life easier for the application developer. * Price and availability: Unknown * Contact: Corona Corp. Menlo Park, CA Tel: (415) 462 8200 Fax: (415) 462 8201 Custom Voice(TM) by A&G Graphics Interface * Description: Speech recognition custom control for Visual Basic, Visual C++, Borland C++, and other development platforms that support *.VBX. Provides an engine/proprietary independent development platform for speech recognition. Currently supports ICSS, but should soon support other platforms. Includes a grammar debugger and parser APIs to parse spoken speech into useful data types. * Requirements: Visual Basic or any development platform that supports VBX. * Price: $US495 or $695 bundled with ICSS. * Contact: A&G Graphics Interface 51 Gore Street, Cambridge, MA, 02139, USA (617) 492-0120 D6006 Voice Control Processor * Misc: Is this chip from the same manufacturer as the D6106 which is described in Jean-Pierre Lereboullet's document on Voice Recognition Processors? * Contact: DSP Telecommunications Inc. 2855 Kifer Road, Suite 202, Santa Clara CA 95051, USA Tel:(408)986-4310 Fax:(408)986-4324 DATAVOX - French * Platform: PC * Description: Continuous speech - speaker independent or dependent. * Rough Cost: ? * Requirements: 2 PC format boards (RdF1000 and TdS 96/25) and an A/D - D/A module (ASA116) * Misc: Application software may dialog with DATAVOX through 2 types of interfaces : + Keyboard overlay: The application software may be used with any PC compatible package. No specific adaptation is necessary, you only need to define your configuration with the application software. + C library: Allows a user-written program to drive the recognition system. DATAVOX is based on the AMADEUS speech recognition software developed at LIMSI. It provides + Continuous speech recognition with 500 words speaker dependent, 50 words speaker independent (custom-made vocabulary). + Grammar of the application language (syntax acquisition, verification and simplification software). + Large vocabulary : DATAVOX can recognize vocabularies of several thousand words as long as there are no more than 500 words in the active vocabulary at any given node. It takes less than 1 second to change syntax and vocabulary. + Training controlled by the system (use of co-articulation models). + Response time less than 500 ms for any phrase length. + Synthetis (ADPCM) can be heard simultaneously while recognition is being carried out. * Contact: VECSYS Le Chene rond, 91570 Bievres, France Fax: 33 1 69 41 24 30 Voice: 33 1 69 41 15 04 Digital Dreams Speech Recognition Plug-Ins * Platform: Apple Quadra AV or Power Macintosh * Description: A suite of speech plug-ins for the interactive multimedia market which enable developers to quickly incorporate speech recognition into their titles without having to resort to a low-level programming language, such as C. Speech plug-ins bridge the gap between a speech recognition API, such as Apple's PlainTalk Speech Recognition technology, and authoring/development environments, such as Macromedia Director or HyperCard. Digital Dreams currently offers Macintosh speech plug-ins for Macromedia Director and HyperCard. Support for other environments, including AppleScript, Apple Media Tool, Authorware, and Windows is being developed. Currently available for North American Adult English. More information is available on the Digital Dreams WWW site. * Requirements: Apple's PlainTalk Speech Recognition extension. * Cost: Single User License $200 * Contact: Digital Dreams 4308 Harbord Drive, Oakland, CA, 94618, USA Tel: (510) 547-6929 Fax: (510) 547-6799 email: dreams@emf.net WWW: http://www.emf.net/~dreams/ FTP: ftp://emf.net/users/dreams DragonDictate version 3.0 * Platform: PC / DOS * Description: Speaker-adaptive recognition system for discrete speech. Provides 110,000 word dictionary and also allows user to add words. Active vocabulary of 5,000, 30,000, or 60,000 words. Allows dictation into almost all DOS applications (word processors, spreadsheets, etc.) and hands-free operation of the PC. Specialized medical and legal vocabularies are available as add-on products. More information on the Dragon Systems WWW pages. * Cost: Prices including audio board and high-quality headset microphone: + 5,000 word Starter Edition: US$695 + 30,000 word Classic Edition: US$995 + 60,000 word Power Edition: US$1,995 + Medical vocabulary add-on: US$495 + Legal vocabulary add-on: US$495 * See also: DragonDictate for Windows and DragonVoiceTools. * Requirements: Minimum of 33 Mhz 486 with 8-16M memory and at least 29M disk space (depending on product), one 8-bit slot, DOS 5.0 and up (also runs in a DOS box under Windows or OS/2). * Contact: Dragon Systems, Inc. 320 Nevada Street, Newton, MA 02160, USA Tel: 1-617-965-5200 or 1-800-TALK-TYP Fax: 1-617-527-0372 Email: info@dragonsys.com WWW: http://www.dragonsys.com/ CompuServe: GO DRAGON Note: Simon Crosby maintains an FAQ for DragonDictate: http://www.cl.cam.ac.uk/users/sac/dd-faq.html DragonDictate for Windows * Platform: PC * Description: Speech-to-text dictation system. Discrete dictation; continuous command/control; speaker-adaptive. Also provides mouse movement for hands-free operation of Windows. Comes with a 120,000 word pronunciation dictionary; users can also add their own words or phrases. Dictate directly into any application. Available in US and UK English, French, Italian, German, Spanish, and Swedish. More information on the Dragon Systems WWW pages. * Requirements: 486/66, 7-10 MB dedicated RAM (depending on edition), Windows 3.1x or 95. Supported sound boards: Creative Labs Sound Blaster 16, Microsoft Windows Sound System, IBM M-Audio Capture/Playback Adapter. * Rough Cost: Prices including software, documentation and microphone: + DragonDictate Personal Edition (10,000 words active) - $395 + DragonDictate Classic Edition (30,000 words active) - $695 + DragonDictate Power Edition (60,000 words active) - $1,695 * See also: DragonDictate and DragonVoiceTools. Simon Crosby maintains an FAQ for DragonDictate: http://www.cl.cam.ac.uk/users/sac/dd-faq.html * Contact: Dragon Systems, Inc. 320 Nevada Street, Newton, MA 02160, USA Tel: 1-617-965-5200 or 1-800-TALK-TYP Fax: 1-617-527-0372 Email: info@dragonsys.com WWW: http://www.dragonsys.com/ CompuServe: GO DRAGON DragonVoiceTools * Platform: PC * Description: Programmer's toolkit for developing speech-aware DOS or Windows applications. Recognizes continuously spoken digits and discretely spoken words or phrases. Up to 1,000 words can be active at one time. Use words from 110,000 word dictionary (included) and/or develop your own word models. More information on the Dragon Systems WWW pages. * Requirements: Minimum of 20 Mhz 386 (larger vocabulary requires faster processor) with at least 5M memory and at least 19M disk space (depending on vocabulary size), DOS 5.0 and up, Windows 3.1 and up, Borland C or C++ or Microsoft C or C++. DOS applications require IBM M-ACPA card available from IBM or Dragon Systems ($325). Windows applications can use industry-standard sound cards (supported are Creative Labs Sound Blaster 16 and Windows Sound System) or M-ACPA card. * Cost: + Developer's kit: US$495 + End-user system: $US195 * See also: DragonDictate and DragonDictate for Windows * Contact: Dragon Systems, Inc. 320 Nevada Street, Newton, MA 02160, USA Tel: 1-617-965-5200 or 1-800-TALK-TYP Fax: 1-617-527-0372 Email: info@dragonsys.com WWW: http://www.dragonsys.com/ CompuServe: GO DRAGON Note: Simon Crosby maintains an FAQ for DragonDictate: http://www.cl.cam.ac.uk/users/sac/dd-faq.html DSP Semiconductor Recognition Chip * Description: Up to 128 word vocabulary, however, the recommended size is 16 words. Requires external memory, a codec and an audio amplifier. Speaker dependent recognition. * Cost: US$18 in quantities * Producer: DSP Semiconductor (no contact details) EARS: Single Word Recognition Package * Platform: UNIX * Description: Intended as a limited ready-to-use single word recognizer. However, its design aims at being a platform for various kinds of methods used in speech recognition (SR). EARS is designed to be a flexible environment for recognition system components; for example, take this feature extractor and that recognizing method, and this list of words. New methods for single word recognition can be integrated easily, as EARS uses C++ abstract base classes. You speak the words you want to be recognized later. Your utterances can be saved to RIFF WAV files for inspection, change or delete them before they are further processed to the pattern files on which the recognizer is finally trained. As of version 0.15, the feature extractors are: Rasta-PLP, PLP, LPC, Mel-Cepstrum. The implemented recognizers are: DTW and non- recurrent neural nets on fixed-size sound patterns. * Misc: The current version is an ALPHA release. * Requirements: AF audio server software (see Q1.11) and the OGI Speech Tools (see Q1.9) * Availability: by anonymous ftp ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/recognition/ears-0 .15.tar.gz ftp://sunsite.unc.edu/pub/Linux/apps/sound/speech/ears-0.15.ta r.gz * Contact: Ralf W. Stephan: ralf@ark.franken.de HM2007 - Speech Recognition Chip * Description: HM2007 is a 48-pin single chip CMOS voice recognition LSI circuit with on-chip analog front end, voice analysis, recognition process and system control functions. A 40 word isolated-word voice recognition system can be composed of an external microphone, keyboard, SRAM and a few other components. When combined with a microprocessor, an intelligent recognition system can be built. A demo board for this chip is being distributed by The Summa Group. * Cost: Approx US$16 for the HM2007 and US$160 for the demo board. * Misc: Jean-Pierre Lereboullet's document on Voice Recognition Processors provides additional information on the HM2007. * Note: Several people have reported problems in obtaining small numbers of this chip (say less than 10). * Producer: HUALON Microelectronic Corp. USA Tel: (415) 288 0390 Fax: (415) 288-0399 * Distributor 1: Marywale Engineering Company Tel: (602) 247 4451 Fax: (602) 247 6167 Email: meco@indirect.com * Distributor 2: The Summa Group Limited One California Street, Suite #1940, San Francisco, CA 94111 Ph: (415) 288-0390 Entropic's HTK (HMM Toolkit) * Platform: Range of Unix platforms. * Description: HTK is a software toolkit for building continuous density HMM based speech recognisers. It consists of a number of library modules and a number of tools. Functions include speech analysis, training tools, recognition tools, results analysis, and an interactive tool for speech labelling. Many standard forms of continuous density HMM are possible. Can perform isolated word or connected word speech recognition. It van model whole words, sub- word units. Can perform speaker verification and other pattern recognition work using HMMs. HTK is now integerated with the ESPS/Waves speech research environment which is described in Section 1.9. * Misc 1: The availability of HTK changed in early 1993 when Entropic obtained exclusive marketing rights to HTK from the developers at Cambridge. * Misc 2: More detailed information on HTK is available from the Entropic WW server: http://www.entropic.com/htk.html * Cost: On request. * Contact: Entropic Research Laboratory, 600 Pennsylvania Ave, S.E. Suite 202, Washington, D.C. 20003, USA Phone: (202) 547-1420. email - info@entropic.com WWW: http://www.entropic.com/ IBM VoiceType Dictation * Platform: Intel I486 with IBM OS/2, Windows or Windows95 * Description: Independent Speaker, discrete speech dictation with navigation. Navigation does not require setup, most applications are automatically speech enabled by dynamic control analysis. Dictation averages 70WPM with 95% accuracy and uses statistical trigram modelling. The base system is 22K words. Laptop support through PCMCIA DSP Card. Additional specialised vocabularies available. + US: Legal, Emergency Medicine, Radiology and Journalism + UK: Legal + IT: Radiology * Requirements: 486SX or above, 16MB Ram, 30MB File space, Dictation Adapter * Cost: Software $495 (includes mic) / Hardware $495 * Misc 1: Based on IBM Tangora Technology * Misc 2: Available as Osborne Personal Dictation System in Australia * Availability: US English. Other languages (UK, FR, GR, IT, and ES) available 3Q94. * Contact: US Contact 1-800-TALK-2-ME or 1-914-766-9252. ICSS system from IBM * Description: A large vocabulary, speaker independent, continuous speech system which runs under Windows, OS/2, and AIX. * Requirements: Soundboard (e.g. Soundblaster) * Price: $US319 * Contact: A&G Graphics Interface ICSS Reseller 51 Gore Street, Cambridge, MA, 02139, USA (617) 492-0120 IN3 Voice Command * Platform: Sun SPARCstation * Description: IN3 provides a secure, robust, word spotting, continuous speech recognition facility for the Sun OS or Solaris operating systems. The recognition system is a secure operating system facility capable of working with various interfaces, microphones, and devices. The operating system interface works with native UNIX outside of X Windows as well as provides enhanced X Windows facilities including named window support. The user interface provides a means to quickly create commands on the fly for replacing long strings and complex operations with voice macros. [Voice macros can reduce the strain of repetitive stress injuries (RSI) such as Carpel Tunnel Syndrome (CTS) by replacing heavy repetitive keyboard hammering with simple voice operations. ] The IN3 user interface works with generic X servers and window managers. A developer API is also available for creating voice-enabled applications, interfacing with other audio sources, and providing extensive application control over the recognition facility. * Availability: SunSite archive at SunSITE.unc.edu as well as on Catalyst CDware as both a runable demo and unlockable software. * Hardware Required: Sun SPARCstation with audio input. Noise canceling microphone recommended but not required. * Software Required: + Sun OS 4.1.2 with OpenWindows 3.0 + or, Sun OS 4.1.3 + or, Solaris 2.1 or Solaris 2.2 * Misc: An equivilant MS-Windows product is described below. * Price: $495 U.S. * Contact: Brantley Kelly Email: cbk@gacc.atl.ga.us CIS: 75120,431 FAX: 1-404-925-7924 Phone: 1-404-813-8030 Command Corp. Inc, 3675 Crestwood Parkway, Duluth GA 30136, USA IN3 Voice Command for Windows * Platform: PC with Windows 3.1 * Description: IN3 is now available for MS-Windows. Users can call applications to the foreground with voice commands. Once the application is called, the user may enter commands and data with voice commands. Voice macros can reduce the strain of repetitive stress injuries (RSI) such as Carpel Tunnel Syndrome (CTS) by replacing heavy repetitive keyboard hammering with simple voice operations. Voice macros take complex operations and reduce them to simple verbal commands. Voice input can provide new facilities for tasks which could not easily have been otherwise performed without the multiple axis of input. IN3 is hardware-independent, users with any Windows-compatible audio add speech recognition to the desktop. IN3 works with either 8 bit or 16 bit Windows audio boards. IN3 is based on continuous word-spotting technology. A developer API is also available for creating voice-enabled applications. * Price: $179 U.S. * Requirements: PC with 80386 processor or better, Microsoft Windows 3.1, and Windows compatible audio system with microphone. * Misc: Fully functional demos are available on Compuserve in various Multimedia and CAD forums. Demos are also available from "America on Line", the comp.binaries.ms-windows archive sites, and various BBS systems. It is also available by anonymous ftp ftp://ftp.wustl.edu/usenet/comp.binaries.ms-windows/v3/in3dem o.zip ftp://ftp.uwasa.fi/mirror/ultrasound/demo/in3demo.zip An equivilant Sun product is described above. * Contact: Brantley Kelly Email: cbk@gacc.atl.ga.us CIS: 75120,431 FAX: 1-404-925-7924 Phone: 1-404-925-7950 Command Corp. Inc, 3675 Crestwood Parkway, Duluth GA 30136, USA Kurzweil Voice for Windows * Platform: MS Windows 3.1 * Description: Kurzweil Voice for Windows is a dictation product enabling the user to create text and enter data by speaking to Windows-based applications. System is adaptive but requires no initial training. Users can choose either 30,000 or 60,000 word active vocabulary. Application command translation templates for popular Windows application such as WordPerfect, 1-2-3, Organizer, Word. * Cost: US $995 * Hardware: 486DX/33 or higher, 8 or 16 MB dedicated memory (depends on vocabulary, 30 MBs dedicated disk space, VGA or higher, Kurzweil-supplied microphone and DSP board. * Contact: Phone: 1-800-380-1234 Email: info@kurz-ai.com Lernout & Hauspie ASR 1000/T and 1000/M [Note: L&H asr200/A is described below.] * L&H asr1000/T: ASR for the Telephony and Telecommunications Market * L&H asr1000/M: TTS for the Computer and Multimedia Market * Description: Automatic speech recognition software providing continuous speech recognition, isolated word recognition, keyword spotting or continuous digits recognition. The engine is speaker independent, and phoneme-based with optimization for commonly used words. General features include: + Languages available: US English, German, French, Spanish (Castilian), Dutch. + Available vocabulary: >100,000 words. + Line adaptation. + Rejection of out of vocabulary/grammar words. + N-best alternatives for isolated word recognition and keyword spotting. + Push to talk. * asr1000/T + Single channel platform examples: Motorola 56156, TI TMS320C2X/C3X/C5X + Multi-channel platform examples: TI TMS320C3X/C5X, AT&T DSP32C/3210, Motorola 96000 + Input: 8 kHz telephone sampling * asr1000/M + Single processor platform examples: Intel 486/Pentium + Input: 8 kHz telephone or 11 kHz microphone sampling * See also: L&H ASR SDK for Windows * Cost: Unknown * Contact: Lernout & Hauspie Speech Products 800 West Cummings Park, Suite 3100 Woburn, MA 01801, USA Tel: (617) 932 4118 Fax: (617) 932 9209 Email: sales@lhs.com Lernout & Hauspie ASR 200/A for the Automotive and Industrial Market * Description: Automatic speech recognition software providing isolated word recognition, keyword spotting and alphabet recognition (optional). This engine is robust, speaker independent and word based. Other features: + Vocabulary: 100 words US English + Voice activation detection + Response time + Platform examples: Analog Devices ADSP2101/5 + Input: 8 kHz telephone or microphone sampling * See also: L&H ASR SDK for Windows * Cost: Unknown * Contact: Lernout & Hauspie Speech Products 800 West Cummings Park, Suite 3100 Woburn, MA 01801, USA Tel: (617) 932 4118 Fax: (617) 932 9209 Email: sales@lhs.com Lernout & Hauspie ASR SDK * Description: Windows based Software Development Kits are available for integrating automatic speech recognition technology with Windows based PC applications. * Requirements: IBM-compatible 486 DX/33 MHz + 8 MB RAM + MS DOS 5.0 + MS Windows 3.1 (or higher) + Sound Blaster compatible sound board. * See also: L&H ASR Products * Cost: Unknown * Contact: Lernout & Hauspie Speech Products 800 West Cummings Park, Suite 3100 Woburn, MA 01801, USA Tel: (617) 932 4118 Fax: (617) 932 9209 Email: sales@lhs.com Listen for Windows 2.0 - Verbex Voice Systems * Platform: Windows * Description: Listen for Windows Version 2.0 is a Speaker Independent software product that provides continuous speech recognition for Windows applications. The product works with most industry standard sound cards and PCs with inbedded audio chips. Listen for Windows comes with over 16,000 commands in speech interfaces for over 40 software applications, such as MS Office, Lotus SmartSuite,Quicken, etc. The Listen Command Editor allows a user to change or add commands to existing speech interfaces or create new speech interfaces for most Windows applications. * Requirements: 486/25SX PC or higher * Cost: $99 without a microphone or $139 with either a desktop microphone or headset * Contact: Verbex Voice Systems 1090 King Georges Post Rd., Bldg 107, Edison NJ 08837, USA Tel: 1-800-ASK-VRBX Tel:(908) 225-5225 Fax:(908) 225-7764 Lotec Speech Recognition Package * Platform: Sun * Description: Public domain speech recognition software. Operates from input in Sun audio format (.au files) and outputs word hypotheses and time labelling data. The software includes programs to collect speech samples, a labeller, a "featurizer" which parameterises speech files, a word spotter and the recogniser. The software can perform real time recognition on a Sparc 10 for small vocabularies. * Requirements: Sun SPARC audio input and a "decent" microphone Sun multimedia demo software (in /usr/demo/SOUND) and X. * Availability: By anonymous ftp ftp://ftp.sanpo.t.u-tokyo.ac.jp/pub/nigel/lotec/lotec.tar.Z * Contact: Nigel Ward: nigel@sanpo.t.u-tokyo.ac.jp Myers' Hidden Markov Model software * Description: Hidden Markov model software for automatic speech recognition. C++ code that implements a basic left-right hidden Markov model and corresponding Baum-Welch (ML) training algorithm. It is meant as an example of the HMM algorithms described by L.Rabiner and others. The code was built in order to learn how HMM systems work and we are now offering it to the net so that others can learn how to use HMMs for speech recognition. Keep in mind that ease of understanding was our primary concern, not efficiency. The code can be used to build an experimental speech recognition systems using "train_hmm" and "test_hmm", and can be used in conjunction with written tutorials on HMMs to understand how they work. * Availability: By anonymous ftp from the comp.speech archive site. There are two files in the directory + ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/recognition/ The files are + hmm.README + hmm-1.03.tar.gz * Contact: Richard Myers: email: rmyers@isx.edu NCC Dictate * Platform: Windows * Description: NCC Digital DictateTM is an add-on, enhanced interface for use with IBM's VoiceType(TM) Dictation for Windows and various Windows 3.1 applications (e.g. MS Word, WordPerfect). Digital DictateTM provides faster corrections and dictation rates and various other features. This version is not a stand alone product; it requires VoiceTypeTM Dictation to provide the speech recognition engine and the Windows application. Features include: + Direct dictation into Windows applications with access to all functions while dictating. + Versions for MS Word, WordPerfect, Ami Pro, and other Windows applications. + Speech enabled editing. + Capability to save speaker models and defer corrections. + Microphone "pause and restore" functions controlled with speech commands. + Add-on vocabularies for legal, medical, science and business. + SWITCH-ITTM foot pedal control or CardSwitchTM infrared wireless control available which switch between dictation and proofing/correction modes. * Requirements: IBM's VoiceTypeTM Dictation for Windows; a computer system meeting VoiceTypeTM Dictation for Windows requirements; VoiceTypeTM Dictation Adapter. * Availability: Through computer dealerships. * Price: $US295 * Contact: NCC Incorporated 5808 E. Turquoise, Scottsdale, AZ 85253 Ph: (602) 922-6236 Fax: (602) 596-9050 OKI VRP6679 - Voice Recognition Processor * Description: Speech recognition IC. 25 words max. Speaker independent recognition capability. Recognition rate quoted as 97% in a noisy environment (e.g. a car). * Misc: Alias MSM6679 * Misc 2: More information is provided in Jean-Pierre Lereboullet's document on Voice Recognition Processors. * Cost: Approx US$20. Demo board $876 * Availability: OKI Semiconductor and OKI Distributors Corporate Headquarters 785 North Mary Avenue, Sunnyvale, CA, 94086 2909 Tel: (408) 720 1900 Fax: (408) 720 1918 Speech Systems Phonetic Engine 500 (PE500) * Platform: PC * Description: Speaker independent, 40,000 word vocabulary, continuous speech recognition for MS Windows. Grammars with high perplexity possible. Includes noise rejection. Uses proprietary DSP board. * Cost: Prices in US$ - quantity one. The PE500 SDK is $995.00 including board, microphone, and runtime software. Runtime only is $595.00. SpeechWizard(r) adds speech input to existing Windows applications, $295.00. Two-day training: $295.00 with purchase, $595.00 without. * Misc: The user defines the grammar of allowed utterances and must write software to invoke the board driver functions that control recognition. The user must also write software to collect/parse/interpret the ASCII text strings returned when recognition succeeds. * Misc 2: SSI now offers speech application development services. * Contact: Speech Systems, Inc. 2945 Center Green Court South Boulder, CO 80301-2275, USA Tel: 303.938.1110 Fax: 303.938.1874 http://www.speechsys.com PowerSecretary * Platform: Centris 650, 660AV. Quadra 650, 660AV, 700,800, 840AV, 900, 950. * Description: Speaker dependent/adaptive system requiring words to be separated by short pauses. Detailed information is available from their WWW page. * Vocabulary: 30,000 at any one time, automatically selected from 120,000-word dictionary. * Cost: US$2,495; non-AV machines need an audio board will cost about US$300. * Requirements: Minimum of 16M of ram and System 7.0. * Contact: Articulate Systems 600 W. Cummings Park, Suite 4500 Woburn, MA 01801 Ph: (617) 935-5656 Fax: (617) 935-0490 WWW: http://www.artsys.com/ ProNotes Voice Tools (due late '95) * Platform: Windows * Description: ProNotes Voice Tools are designed to bring the speech recognition capabilities of the IBM VoiceTypeTM Dictation System for Windows into any program without the need for the programmer to directly interface with the speech engine at the API level. There are five tools, as described below, which are all available in three forms: Visual Basic(TM) Custom Controls (known as VBXs), 16-bit OLE Custom Controls, and 32-bit OLE Custom Controls. The tools are intended for use by Windows(TM) developers working with Windows 3.1(TM), Windows for Workgroups 3.11(TM), Windows NT 3.51 Workstation(TM), and Windows 95(TM). The custom controls can be utilized with any application development environment which supports the use of such controls (e.g. Visual Basic and Visual C++). Playback and Record An object which allows developers to use the IBM Speech Engine to record and play back sound files. Can be used to add voice prompts and to allow end users to record and playback sound files. Voice Button An object having standard button properties and behavior, which can additionally be controlled by voice. The button can also be used as a label or a 3D panel. Dictation Window A text box that allows free dictation, voice macro utilization, and correction by voice. Each Dictation Window has access to global and context sensitive vocabularies for both command and dictation. There are three correction modes. Voice List Box Has standard list box properties and behavior, but can additionally be controlled by voice. A user can select items by pronouncing the entry's text or the entries can be numbered and selected accordingly. Voice Navigator Provides navigation by voice within an application developed with the Voice Tools, between voice-enabled objects described above, as well as some standard objects found within the application. * Availability: ProNotes Voice Tools is due for release before the end of '95. * Contact: Pronotes, Inc. 1546 Magee Avenue, Philadelphia, PA 19149 Ph: (215)-533-8569 proinfo@pronotes.com PureSpeech 2.0 Recognition Engine * Platform: Windows 3.1, Windows 95, Unix, Dialogic Antares DSP * Description: Speaker-independent, continuous speech, large active vocabulary speech recognition engine for American English. Permits on-the-fly additions to the vocabulary using phonetic models and telephone or wideband microphone input. Flexible grammar, natural language processing, discourse models. Software only with a small RAM/CPU footprint. Can be used as a voice user interfaces (VUI's) for PC software applications. Can also be used for high-volume call center telephony, especially in banks, finance and other specialized applications. * Availability: PureSpeech is not available as a stand-alone product. It is embedded in other Windows-based software. * Contact: PureSpeech, Inc 100 CambridgePark Drive, Cambridge, MA 02140, USA Ph: (617) 441-0000 Fax: (617) 441-0001 recnet * Platform: UNIX * Description: Speech recognition for the speaker independent TIMIT and Resource Management tasks. It uses recurrent networks to estimate phone probabilities and Markov models to find the most probable sequence of phones or words. The system is a snapshot of evolving research code. There is no documentation other than published research papers. The components are: + A preprocessor which implements many standard and many non- standard front end processing techniques. + A recurrent net recogniser and parameter files + Two Markov model based recognisers, one for phone recognition and one for word recognition + A dynamic programming scoring package. The complete system performs competatively. * Cost: Free * Requirements: TIMIT and Resource Management databases * Contact: Tony Robinson: ajr@eng.cam.ac.uk * Availability: by anonymous ftp ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/recognition/recnet -1.3.tar.Z SayIt * Platform: Sun SPARCstation - SunOS 4.1.x ONLY - SayIt uses NeWS which is no longer available on Solaris 2.x * Description: Voice recognition and macro building package for Suns in the Openwindows 3.0 environment. Speaker dependent discrete speech recognition. Vocabularies can be associated to applications and the active vocabulary follows the application that has input focus. Macros can include mouse commands, keystrokes, Unix commands, sound, Openwindow actions and more. An evaluation copy is available by email. * Hardware: Microphone required (SunMicrophone is fine). * Cost: $US295 * Contact: Phone: 1-800-245-UNIX or 1-415-572-0200 Fax: 1-415-572-1300 Email: info@qualix.com WWW: http://www.qualix.com/ Simon Says - for NeXT * Platform: NeXT * Description: Provides the ability to link commands to spoken phrases. * Cost: Unknown * Availability:By anonymous ftp Simon Says demo ftp://ftp.informatik.uni-muenchen.de/pub/comp/platforms/next/A udio/audio-apps/SimonSaysDemo.1.5.1.N.b.tar.gz Readme file ftp://ftp.informatik.uni-muenchen.de/pub/comp/platforms/next/A udio/audio-apps/SimonSaysDemo.1.5.1.README * Contact: Metrosoft 710 13th Street, Suite 310 X, San Diego, California 92101 Ph: 619.488.9411 Fax: 619.488.3045 Email: info@metrosoft.com [NeXTmail welcome] Speech Commander - Verbex Voice Systems * Platform: Various - Serial Port connection * Description: A hand-held (portable) device about the size of a paperback book which provides speaker-dependent continuous speech recognition. The device connects through a serial port, so it can be connected to a wide range of computers. It comes with a battery pack. * Misc: Could someone please provide more detailed information on vocab size, training etc? * Contact: Verbex Voice Systems 1090 King Georges Post Rd., Bldg 107, Edison NJ 08837, USA Tel:(908)225-5225 Fax:(908)225-7764 'Speech Recognition Expert' Toolkit for Windows * Description: Provides an object-oriented development tool designed to rapidly build speech enabled applications without writting source code. Currently supports IBM's VoiceType Application Factory. Future versions to support other platforms. Includes BlackBox library and Custom Grammar Tools. * Requirements: Layout for Windows from Objects, Inc. * Price: $US349 + Shipping/Handling * Contact: Speech Technologies, Inc. P.O. Box 3905 Naperville, IL 60567-3905 CompuServe @102147,3521 Ph: (708)983-7634 Visual Voice from Stylus Innovation * Platform: Microsoft Windows * Description: Visual Voice is a toolkit for building Windows-based voice processing and telephony applications including interactive voice response (e.g. touch-tone banking), fax-on-demand, and voice mail. Visual Voice can be used to add voice recognition to your telephony applications. Voice Recognition (VR) Support for Visual Voice is exposed as a standard VBX control and provides one or more voice recognition "resources" to your application. Applications can dynamically assign resources across several voice lines. Voice recognition is either "discrete" or "continuous". Discrete recognition is slightly more accurate and requires the speaker to pause briefly between words. Continuous recognition provides a natural way to enter information by speaking without pauses. Three configurations are supported: Software-Only Solution The software only solution uses Telaccount's SpeechEasy technology for discrete recognition using your PC's CPU. A vocabulary is included with digits, basic command words and more. Hardware-Assisted Solution with Dialogic AEB boards Discrete voice recognition in over 25 languages using Dialogic D/41D voice boards and the Dialogic VR/40 board. Vocabularies are included with digits, basic command words, voice mail vocabulary and more. Hardware-Assisted Solution with Dialogic PEB boards Use the VR control with any Dialogic PEB-based voice board, such as the D/12x or D/24x, to access voice recognition resources from your phone lines. This requires a Dialogic VRP board with either 1 to 4 VRM/40 modules (4 channel discrete voice recognition modules) and/or 1 to 4 VRM/2C modules (2 channel continuous voice recognition modules). You can have up to 4 modules on each VRP: 4 VRM/40s for 16 channels of discrete voice recognition; 4 VRM/2Cs for 8 channels of continuous recognition; or a combination. Over 25 languages supported. Includes vocabularies as described above. * Pricing: Unknown * Availability: From Stylus Innovations Inc. or from the distributors listed on the Stylus WWW pages. * Misc: More detailed technical information, slide show demonstration software is available on the WWW http://www.stylus.com/stylus/ * Contact: Stylus Innovation Inc. One Kendall Square, Building 300, Cambridge, MA 02139 Ph: (617) 621 9545 Fax: (617) 621 7862 WWW: http://www.stylus.com/stylus/ Compuserve forum: GO STYLUS Email: info@stylus.com Voice Command Line Interface * Platform: Amiga * Description: VCLI will execute CLI commands, ARexx commands, or ARexx scripts by voice command through your audio digitizer. VCLI allows you to launch multiple applications or control any program with an ARexx capability entirely by spoken voice command. VCLI is fully multitasking and will run in the background, continuously listening for your voice commands even while other programs are running. Documentation is provided in AmigaGuide format. VCLI 6.0 runs under either Amiga DOS 2.0 or 3.0. * Cost: Free? * Requirements: Supports the DSS8, PerfectSound 3, Sound Master, Sound Magic, and Generic audio digitizers. * Availability: by ftp from wuarchive.wustl.edu in the file systems/amiga/incoming/audio/VCLI60.lha and from amiga.physik.unizh.ch as the file pub/aminet/util/misc/VCLI60.lha * Contact: Author's email is RHorne@cup.portal.com Voice Control Systems Continuous Speech Recognition * Description: Voice Control Systems (VCS) continuous speech recognition is a proprietary phonetic recognizer based on technology developed at VCS over the last 17 years. It is robust for applications such as the "hands-free" automotive environment or telephone networks, both wireless and wireline. VCS speech recognition is used by many developers and manufacturers in telecommunications. VCS technology is a software-based capability which VCS has currently developed for a limited number of processing environments. VCS offers "off-the-shelf" capabilities for the TI-C3X and C4X DSPs with other hardware platform support planned for the future. As a benchmark, today's VCS continuous technology requires about 1/2 of a 33Mhz TMS320C31. VCS continuous technology is available in cellular and wireline based libraries for continuous digit input in approximately 15 languages. VCS continuous recognition is a modified HMM decision strategy built upon the foundation of VCS phonetic "front end". * Availability: VCS continuous technology is available today in software form from VCS or implemented in hardware or speech systems from VCS distributors including Dialogic Corporation, Brite Voice, Intervoice, Periphonics, and Syntellect. * Cost: Software royalties are volume based and range from per unit costs of $500 per recognizer to less than $5 in large quantities. * See also: the VCS Phonetic Dictionary Recognizer and VCS Isolated Word Speech Recognition below, and the VCS 2030 & 2060 Voice Dialers. * Contact: Voice Control Systems, Inc. 14140 Midway Rd., Dallas, Tx. 75244, USA Ph: +1-214-386-0300 Fax: +1-214-386-5555 Email: sales@vcsi.com Voice Control Systems Phonetic Dictionary Recognizer * Description: This recognizer is based upon a HMM type recognition strategy coupled with the VCS "front end" (feature extraction software). The HMM modeling is based upon the basic phonetic building blocks in each language. In American English this is approximately 43 units. The recognition vocabulary is built up by combining these units into word models. By building the words in this way new recognition vocabularies may be constructed. The phonetic assembly can also be used for "word spotting" recognition libraries. * Platform: This VCS recognition software runs on the TI TMS320C30 DSP. Two recognizers can operate on a single 55mhz C30. Currently the software may be purchased as an Enhanced Technology from VCS to run on the Dialogic VR/160p speech recognizer board. The hardware is purchased from Dialogic, with the "Enhanced" software purchased from VCS. Up to four phonetic recognizers can run on a single 160; one per VRM2C (C30-33mhz DSP) daughtercard. * Note: This recognizer is in its late "beta" stage of development and is available for U.S. English vocabularies. Other languages are presently under development. * Price: VCS software is priced at $350 per recognizer for unit quantities with volume discounts available. * See also: VCS Continuous Recognition above, VCS Isolated Word Speech Recognition below, and the VCS 2030 & 2060 Voice Dialers. * Contact: Voice Control Systems, Inc. 14140 Midway Rd., Dallas, Tx. 75244, USA Ph: +1-214-386-0300 Fax: +1-214-386-5555 Email: sales@vcsi.com Voice Control Systems Isolated Word Speech Recognition * Description: Voice Control Systems (VCS) isolated word recognition using VCS phonetic recognizer technology. It is robust in demanding environments such as the "hands-free" automotive environment, telephone networks, wireless or wireline. Capabilities include speaker-independent, speaker-dependent and speaker-adaptive recognition. Libraries are available for 45+ languages and custom vocabulary development services are available. The technology is suited for many applications including: + Desktop computing: such as keyboard accelerators orinteractive multimedia. + Network telephony: such as automating operator functions or voice dialing. + Computer telephony: such as remote access to a personal computers. + Automotive accessory control: such as voice activated cellular phones or other automotive accessories. + Consumer electronics: such as voice controllers for video games or VCRs and televisions. * Platform: Include Intel-X86, TI-C5X, C3X, C4X and C2X, OKI 6679, and NEC-V20 and V30, and can operate on 16 bit microcontrollers. As a benchmark, 8 recognizers can run on an Intel 486-33 DX. * Availability: The technology is available under software licenses direct from VCS or by purchasing hardware from an OEM. VCS OEMs include: Dialogic, Oki Semiconductor, Intervoice, Periphonics, etc. * Cost: VCS isolated word recognition software is available under a volume pricing license agreement. Small quantity royalties are in the $500.00 per recognizer range while large (millions) quantity royalties are less than $1.00 per recognizer. * See also: VCS Continuous Speech Recognition and VCS Phonetic Dictionary Recognizer above, and the VCS 2030 & 2060 Voice Dialers. * Contact: Voice Control Systems, Inc. 14140 Midway Rd., Dallas, Tx. 75244, USA Ph: +1-214-386-0300 Fax: +1-214-386-5555 Email: sales@vcsi.com Visus SpeechKit * Platform: NeXT * Description: SpeechKit is based on SPHINX, a speaker-independent, 1000 word or so, continuous speech recognition system which allows you to incorporate speech recognition into your applications. You can design your vocabulary and grammars. * Contact: Visus - no address or phone provided. A possible contact is Robert Brennan at Carnegie Mellon University. email: Robert_Brennan@cmu.edu VCS 2060 Voice Dialer VCS 2030 Voice Dialer * Platform: Stand-alone unit, TMS320C5X based with VCS phonetic speech recognition and CELP speech compression. * Description: The VCS 2060 is a telephone dialing system which recognizes 50 names - and speed dials the associated telephone number. The VCS 2030 has 20 memories. Users use speaker-independent recognition to select the "call", "program", or "list" menu, then place a call, enroll a new memory, or listen to playback of entries in the phonebook. Enrollment is simple and includes a "name tag" enrollment pass so that when one selects an entry to call, the selection is confirmed by repeating the memory's associated name tag, e.g. "calling Pete". The system uses both speaker-independent and speaker-dependent technology from Voice Control Systems, Inc. * Installation: The VCS 2060 can be installed in series (RJ-11) with one phone for single phone operation or installed in parallel (RJ-31) to provide voice dialing from every phone in a house. * Cost: Standard retail prices: + VCS 2030 Voice Dialer - $269.00 + VCS 2060 Voice Dialer - $299.00 * Availability: From catalogs or direct from Voice Control Systems. Voice Control Systems 14140 Midway Rd., Dallas, Tx. 75225, USA Ph: 800-VCS-7525 Fax: 214-386-5555 Email: sales@vcsi.com Voice-Trek 2.0 * Platform: ? * Description: ? * Contact: Tardis Technology Inc., Voice Recognition Div. 10321 Los Alamitos Blvd., Los Alamitos CA 90720 Tel:(310)799-3355 Fax:(310)799-3360 Creative VoiceAssist * Platform: PC (?) * Price: $US99.95 * Contact: Creative Labs Ph: 1-800-998-5227 Voice Blaster Ver. 4.0 * Platform: IBM AT or higher, DOS or Wndows 3.1 * Description: Uses a Sound Blaster or compatible board. Contains a microphone headset and a connector for LPT1:. A printer can still be used on LPT1:. Will recognize 1024 words that are trained by the operator. Each word activates a macro that can enter an ascii word on the screen or into a word processor or invoke a batch file. An optional footswitch may be installed. Software to run under DOS or Windows 3.1 is included. * Cost: Unknown * Contact: Unknown (original supplier has been sold) VoiceServer for Windows * Platform: PC * Description: Speaker dependent, each with an independent directory. Isolated word. Upto 1000 words/user, 300 words/window. 1 word occupies 2Kb on hard disk. Can be used to control Windows applications by issuing voice commands instead of menu selection. * Rough Cost: 292 Pounds(UK) * Requirements: None * Misc: Price includes a half-sized AT voice card (including a DSP), software, documentation & a microphone (attachable to keyboard or speaker). A light-weight high-spec headset is an optional extra. * Contact: Mark Redwood Applied Voice Technologies 26 Danbury Street, Islington, London, UK, N1 8JU Ph: + 44 71 454 1224 : Fax: + 44 71 454 1225 Votan * Platform: MS-DOS, SCO UNIX * Description: Isolated word and continuous speech modes, speaker dependant and (limited) speaker independent. Vocab size is 255 words or up to a fixed memory limit - but it is possible to dynamically load different words for effectively unlimited number of words. * Cost: Approx US $1,000-$1,500 * Requirements: Cost includes one Votan Voice Recognition ISA-bus board for 386/486-based machines. A software development system is also available for DOS and Unix. * Misc: Up to 8 Votan boards may co-exist for 8 simultaneous voice users. A telephone interface is also available. There is also a 4GL and a software development system. Apparently there is more than one version - can anyone provide more detail? * Contact: 800-877-4756, 510-426-5600 Voice Processing Corporation Speech Recognition Product Line * Description: Voice Processing Corporation (VPC) supplies automated speech recognition systems. VPC's products are used in the telecommunications, cellular and personal computer markets to enable computers to understand human speech. The company's VPro product line is sold to original equipment manufacturers (OEMs), value added resellers (VARs), system integrators and application developers. VPC's speech recognition systems are currently used in applications such as voice mail, voice activated dialing, interactive voice response, and command and control of personal computers. The following are descriptions of the Voice Processing Corporation's VPro Product Line: VProContinuous, VPro/XD, VPro/RT, VProCel, VProSpeller, VProPRL, VPro hardware platforms, and the application Osprey. More information is available on these products at the VPC WWW site: http://www.vpro.com/ * VProContinuous(TM) is a speaker-independent, continuous digit recognizer. It recognizes digit strings spoken in a continuous manner, by any caller, without unnatural beeps or pauses. VProContinuous uses out-of-vocabulary rejection and word spotting technologies to reject extraneous words and phrases often spoken by callers. The VProContinuous vocabulary consists of the words "zero" through "nine," "yes," "no," and "oh." The product is language-independent. American English, Australian English, Brazilian Portuguese, Canadian French, Castilian Spanish, French, German, Italian, Mexican Spanish, Portuguese, Swiss German and U.K. English versions are available. * VPro/XD(TM) is a discrete or multiword speech recognizer for extra-demanding applications and/or vocabularies. This robust discrete product recognizes isolated discrete utterances (words or very short phrases). VPro/XD utilizes proprietary out-of-vocabulary rejection and word-spotting technologies. VPro/XD is speaker-independent and includes Talkover capability allowing speech-interrupt over prompts. Pre-trained vocabulary libraries are available in American English, Australian English, Brazilian Portuguese, Canadian French, Castilian Spanish, Central American Spanish, German, Italian, Mandarin Chinese, Mexican Spanish, Portuguese, Swiss German and UK English. Pre-trained vocabularies consisting of voice mail words, voice dialing words, call control words, banking, and emergency words are available in American English (both cellular and land-line). * VPro/RT(TM) is a discrete speech recognizer for rapid training of vocabularies in the field. This robust discrete product recognizes isolated discrete utterances. Application designers and end-users define the vocabulary of their choice and train the system in real-time either prior to system start-up, or adapting on-the-fly while the system is running live. Vocabularies can be subset, and applications involving thousands of words can be developed quickly. VPro/RT, which also supports Talkover, is suited to speaker-dependent recognition tasks, such as the personal directory of names in a voice-activated dailing application. VPro/RT is also good for applications that require speaker-independent vocabularies to be developed quickly in the field or those that require many vocabularies. VPro/RT can also be used as a tool for quick prototyping of applications. * VProCel consists of speaker-independent VProContinuous, VPro/XD and speaker-dependent VPro/RT specifically tuned for the cellular environment. The speaker-dependent discrete feature of VProCel allows for a user-defined 20-word personal directory, with a one-pass enrollment whereby users need only speak their chosen commands once. In addition, cellular-ready VPro/XD vocabularies consisting of voice-activated dialing command words are also available. VProCel is suited to voice-activated dialing applications using either digit strings or a listing of words in a personal directory. * VProSpeller is a recognizer that can determine which name or word is being spelled by a caller. Users may spell a string of letters (up to 32 letters) in an uninterrupted manner (without prompts or beeps between each letter). VProSpeller can recognize confusable letters by conducting an automated search of a database of words maintained by the application for the best candidates to match. * VProPRL Designed for customers who wish to enable VPC speech recognition technologies on platforms other than those supported by VPro hardware, the VProPRL is a portable recognizer library of VProContinuous, VPro/XD and VPro/RT, which can be embedded into a wide variety of hardware platforms. It consists of a library of object modules which can be linked with a user application or task. * VPro Hardware Platforms: VPro-42, VPro-84, VPro-88 : The VPro platforms are ISA compliant PC/AT boards. Each supports four to eight Virtual Speech Processors (VSPs). Each VSP, depending on load factors, can handle multiple telephone lines. Application and host computers communicate with each of the VSPs as separate autonomous units. VPro platforms use Texas Instruments TMS320C31 microprocessors which provide up to 133 MFLOPS of compute power. The platforms can have up to 8 megabytes of memory shared among all processors. In addition, each processor has 512K bytes of local memory. Both the PEB and MVIP PCM audio buses are supported by all VPro platforms. * Osprey is a call management software application that performs the kinds of telephone related activities typically done by a personal assistant, such as answering the phone, screening callers, routing calls, and taking and delivering messages. It is an automated phone attendant. * Price and availability: Contact Voice Processing Corporation * Contact: Kelli V. Smith Voice Processing Corporation 1 Main Street, Cambridge, MA, 02142 USA Ph: (617)494-0100 Fax: (617)494-4970 e-mail: KSmith@vpro.com WWW: http://www.vpro.com/ ___________________________________________________________________________ Copyright (c) 1995 by Andrew Hunt, all rights reserved. This FAQ may be posted to any USENET newsgroup, on-line service, or BBS as long as it is posted in its entirety and includes this copyright statement. This FAQ may not be distributed for financial gain. This FAQ may not be included in any collections or compilations without express permission from the author. --- Andrew Hunt ATR Interpreting Telecommunications Research Labs Hikari-dai 2-2, Seika-cho, Kyoto, 619-02, Japan Tel: +81-774-95 1390 Fax: +81-774-95 1308 Email: andrew@itl.atr.co.jp .