[HN Gopher] Automatically transcribe an interview, meeting or video
       ___________________________________________________________________
        
       Automatically transcribe an interview, meeting or video
        
       Author : MajidMM
       Score  : 66 points
       Date   : 2021-05-10 08:24 UTC (14 hours ago)
        
 (HTM) web link (voicedocs.com)
 (TXT) w3m dump (voicedocs.com)
        
       | eloeffler wrote:
       | Fun fact: In Germany, most state parliaments and the state
       | parliament still use hand-written stenography for protocols
       | because it is still most reliable (catching all: shouts, noise-
       | expressions from the crowd, etc.) and wasn't replaced by a typing
       | system because up to date there is no typing stenography that
       | keeps up with the speed of hand-written stenography (in German
       | language).
        
         | gumby wrote:
         | Note that recording via stenography is a two-step process.
         | 
         | The first is to record the _sounds_ you hear. Look at a common
         | stenographic  "alphabet" (often called "shorthand alphabet"
         | though that practice is essentially dead) or at the keyboard of
         | a stenographic machine.
         | 
         | Then the stenographer reads the output (either hand or machine
         | generated) and writes a text using a combination of cue (from
         | the paper) and memory.
         | 
         | This is quite different from trying to do straight text-to-
         | speech.
        
         | fxtentacle wrote:
         | I once tried to build a German service for transcribing online
         | meeting calls, similar to what UberConference now offers, by
         | using a cloud API for the STT.
         | 
         | Oh wow was I surprised to see the quality. All of the cloud
         | providers are abysmally bad at transcribing German.
         | 
         | I believe the reason is that in German, you can make up word
         | combinations on the fly and use them as valid nouns. And people
         | do that, if it's convenient or if it enables you to be more
         | precise.
         | 
         | "Dampfschiffahrtsgesellschaft" = Society (Gesellschaft) for
         | Driving (Fahrt) of Boats (Schiff) with Steam (Dampf)
        
           | creshal wrote:
           | > Society (Gesellschaft)
           | 
           | In this context, Gesellschaft translates to Company.
           | (GmbH=LLC)
           | 
           | The spelling also depends on whether you're talking about the
           | historical Erste Donau-Dampfschiffahrts-Gesellschaft or any
           | generic Dampfschifffahrts-Gesellschaft - note the ff vs. fff
           | in middle; the old company name retains its pre-1996
           | spelling.
           | 
           | Donaudampfschiffahrtsgesellschaft without hyphens was as far
           | as I can tell never officially used by the company, but used
           | informally as part of the name of the
           | Donaudampfschiffahrtsgesellschaftskapitanstango, a 1930s
           | song.
        
           | rvba wrote:
           | After the simplification of spelling system ( https://en.wiki
           | pedia.org/wiki/German_orthography_reform_of_1... ), German
           | got a big advantage: you can write and read nearly
           | everything, even if you don't know its meaning.
           | 
           | Due to much more complicated grammar German is much more
           | difficult to learn than English, but at least the spelling is
           | easy.
           | 
           | I wonder why more languages never try to simplify their
           | orthographies. Children could spend years learning useful
           | things, instead of wasting time on spelling.
           | 
           | Controversial opinion here: they should have removed ss
           | (sharffes S) completely. It is still used in some relatively
           | rare cases.
        
           | thunderbong wrote:
           | After reading the last line of your comment I put the words
           | back in the German word (in English) and got -
           | 
           | SteamBoatDrivingSociety
           | 
           | Which actually made complete sense even in English!
        
           | spzb wrote:
           | In my experience, they're pretty poor in English too.
           | Especially when it's ad hoc conversation where people don't
           | finish sentences, repeat themselves, "um" and "err" etc
        
             | hnbad wrote:
             | A problem most people don't think about when talking about
             | transcription is that people don't talk like books. Not
             | only do you get unfinished sentences and filler words, you
             | also get garbled words, non-standard pronunciation, and so
             | on.
             | 
             | In the case of pronunciation this primarily poses a problem
             | with detecting the intended word, but in other cases
             | "cleaning up" the output may lose contextual information
             | (e.g. what a speaker was going to say before cutting
             | themselves off and using a different word). This is
             | difficult enough for a human to get right, let alone a
             | machine.
        
               | rob74 wrote:
               | Also, when transcription programs (most familiar example:
               | YouTube) fail, they usually fail on the words that a
               | human listener would also have trouble understanding /
               | telling apart. So the transcription is useful if you are
               | deaf or forced to watch the video without sound, but if
               | you're using subtitles because your English is not good
               | enough to understand the speakers without them, their
               | usefulness is pretty limited...
        
             | hobofan wrote:
             | Once you go beyond ~7 words (= what people would utter to
             | their virtual assistant), the quality of all off the shelf
             | tools (both open source and offered services) is laughable.
             | Sentence boundary detection, punctuation, speaker
             | segmentation, and all those features you would need for
             | good transcription are in a really bad state.
        
               | lostinthefield wrote:
               | "Really bad" is an exaggeration, I think. The auto-
               | transcription features in both Google Meet and Zoom are
               | more than acceptable, they're often very useful in
               | catching missed words during a meeting.
               | 
               | They trip up on technical jargon but handle everyday
               | conversations just fine, including speaker detection,
               | punctuation, idioms, etc.
               | 
               | But that's also a slightly different use case, where each
               | speaker is in their own (somewhat) quiet environment and
               | on separate connections (and thus audio tracks).
               | 
               | It's much harder to do all that after the fact, like with
               | a recorded video.
               | 
               | I find Trint.com, which is partially automatic, to be
               | good for that... the AI does a first pass, and a human
               | cleans it up afterward. YouTube has a similar assisted-
               | auto feature for their captions, minus speaker
               | separation.
        
       | fr33k3y wrote:
       | I'm curious about your service, can you explain what's different
       | from similar services like happy scribe for instance?
       | 
       | As it was already said you should make clear which languages are
       | supported.
       | 
       | And I think you should put prices in USD and/or Euros instead of
       | TL (turkish lira), ideally Euro's for european visitors and UDS
       | for the rest of the world. Besides the free tier, if I'm serious
       | about the service I will be less keen to test it out before
       | knowing the cost of it and at first I've seen the price without
       | looking too much and thought it was pretty expensive before
       | understanding it was expressed in TL's.
        
         | rauf_f wrote:
         | Sorry, there was a bug in pricing page. It should now show the
         | prices in USD. The difference is own Speech recognition engine,
         | easy document-like editor and separate subtitle editor.
        
       | bkovacev wrote:
       | Do you support speaker diarization?
        
       | jcims wrote:
       | I think you're going to have a hard time competing with the major
       | cloud providers on transcription alone. AWS Transcribe, for
       | example, is quite easy to use and supports batch transcription as
       | well as streaming, custom language models, etc.
       | 
       | There's still quite a bit of value-add possible on top of that,
       | however. The ability to edit transcriptions is a great start,
       | especially if you maintain timecodes against the media.
       | Developing or curating domain-specific language models to improve
       | accuracy is also a likely option. There also appears to be a lot
       | of interest in using real time transcription to augment live
       | events with content derived from the conversation.
       | 
       | Good luck!
        
         | rauf_f wrote:
         | Thanks for the note! As you have stated, there's still a lot of
         | work for researcher/journalist after getting raw transcription,
         | so good editing tool syncing audio and text is valuable here.
        
       | varispeed wrote:
       | I wouldn't feel easy uploading sensitive information for
       | "transcription". Who is this service for? As an interviewee I
       | also wouldn't consent that potential employer could disclose my
       | information in such way.
        
       | Clewza313 wrote:
       | Quite a few large companies intentionally do not use audio
       | transcription services, because they don't want the liability of
       | everything everybody has ever said being written down in a format
       | that can easily searched during legal discovery.
        
       | offtop5 wrote:
       | Considering the AWS API is essentially open for everyone to start
       | a transcription service, what exactly is the difference here. If
       | you know what you're doing you can build this in about 4 hours.
        
         | frankenst1 wrote:
         | And "Dropbox is just SVN mounted on top of curlftpfs" - doesn't
         | mean there isn't a market to make technological capabilities
         | more easily accessible for the masses.
         | 
         | That being said, I am skeptical about the quality and would
         | like to see some demos. Audio recordings of meetings are
         | especially difficult to transcribe accurately.
        
           | offtop5 wrote:
           | That's a great point, but I'm seeing an absolute explosion of
           | transcription services which are all essentially based on
           | AWS.
           | 
           | The only real innovation here is when this is combined with
           | language learning apps to help me practice my Chinese
           | pronunciation, but even then I know I'll have to look to hire
           | a tutor soon.
        
       | zackees wrote:
       | I just made a python package that does everything you are
       | offering but for free. And yes it does direct links to youtube
       | and twitter.
       | 
       | pip install transcribe-anything transcribe_anything <YT_VID>
       | out.txt
        
       | CharlesW wrote:
       | Anyone know why this is interesting enough to be on Hacker News?
       | There are lots of services which do this, many of which have
       | significantly better functionality. This just looks like a very
       | thin wrapper on a cloud speech-to-text service.
        
         | rauf_f wrote:
         | The company builds its own speech-to-text engine and has better
         | accuracy than Google in German and Turkish languages.
         | Independent review:
         | https://www.abtipper.de/transkription/sprache-zu-text/
        
       | disabled wrote:
       | This is a good tool to use when dealing with health insurance in
       | the United States. You should at minimum keep an Excel
       | spreadsheet of date, whom you talked to, which department they
       | are from, purpose of the call, follow up actions, etc.
       | 
       | But, with the way insurance has been going in the US lately, you
       | better be recording and transcribing that call. Usually, if the
       | call line is recorded (basically all US health insurance
       | companies do this) you can legally record the phone call without
       | permission from the other party.
       | 
       | I personally have an NVIDIA Jetson AGX Xavier with AI tools for
       | speech-to-text, person identification, and transcribing, which I
       | use for important phone calls. I use my own AI tools and devices
       | for privacy reasons.
        
         | rubatuga wrote:
         | Please let us know what models you use for STT!
        
       | laurex wrote:
       | The pricing on this transcription is very high ($12/hr) for
       | automated transcription. Compare to existing solutions like
       | Descript, Rev, Otter.ai - what makes it so much better?
        
       | hnbad wrote:
       | Obvious caveat that automatic transcriptions are not a
       | replacement for manual transcriptions. They're better than
       | nothing but the problem with mistakes in automated transcriptions
       | is that they can entirely change the meaning of a statement in
       | ways that are not necessarily obvious if you don't listen to the
       | audio at the same time.
       | 
       | They also struggle with domain specific jargon depending on what
       | data they were trained on. While manual transcriptions will mark
       | ambiguous utterances as such (or ask for additional information),
       | automation can create a false sense of certainty while just
       | "guessing" whatever it matches most closely. This is a hard
       | problem and unlikely to be solved soon.
        
         | ghaff wrote:
         | I find they serve different use cases.
         | 
         | ML transcriptions are fast/cheap and they're fine if you mostly
         | want to pull out some quotes or check some things in your
         | notes. But, in general, I find they're not remotely worth my
         | time if I'm going to publish a transcript in which case I get a
         | human transcription. (And even that can be a bit tough with
         | accents, technical jargon, overlapping voices, etc.)
        
           | hnbad wrote:
           | I would agree but given that automatic transcriptions are
           | cheaper, many people treat it as an alternative when manual
           | transcriptions would be more appropriate.
           | 
           | Some tech conferences were pretty good about hiring actual
           | people for live captioning, which was great, but with
           | conferences mostly happening online via video streams at the
           | moment, automated captions and transcriptions might seem like
           | an obvious choice if you don't understand the limitations.
        
       | Johnyma22 wrote:
       | I wonder if you could transcribe really-real time into something
       | like Etherpad? https://etherpad.org
        
       | robsalasco wrote:
       | is the spanish language supported?
        
       | mjparrott wrote:
       | A lot of spoken text is highly inefficient to read.
        
       | Aeolun wrote:
       | This is cool, but what exactly do you transcribe? What happens if
       | I upload a Spanish or Japanese video?
        
       | hnbad wrote:
       | I just noticed but even the German language footer doesn't
       | included a link clearly labelled "Impressum". That information
       | seems to be in the privacy policy (which I can only get in
       | English even when switching to German?) but that is not
       | sufficient to meet German legal requirements.
       | 
       | The privacy policy also doesn't provide all the information the
       | GDPR generally requires you to provide, e.g. spelling out users'
       | rights under the GDPR and what legal basis is given for
       | collecting each specific piece of information.
       | 
       | I'm mostly pointing this out because it could get them sued, but
       | I'd also expect a company based on a service like this to take
       | privacy a bit more seriously, or at least present themselves as
       | if they do so.
        
         | rauf_f wrote:
         | Thanks for the review. The privacy policy lists all collected
         | information, how (if any) they're shared with other parties,
         | also right for the users to delete the information any time.
         | What else should be listed here? I didn't get the "legal basis
         | for collecting each information" -- is it required? This is
         | just basic information that software needs to operate.
        
           | hnbad wrote:
           | Well, first of all, you need a link clearly indicating it's
           | the "Impressum" (usually translated as "imprint", "legal" or
           | similar in English versions) as per SS5 TMG: https://de.wikip
           | edia.org/wiki/Impressumspflicht#Telemedienge...
           | 
           | You can get sued for omitting such a page (by any bored
           | lawyer really) because it's considered anti-competitive and a
           | misdemeanor: https://de.wikipedia.org/wiki/Impressumspflicht#
           | Ordnungswidr...
           | 
           | Here's a lengthy explainer of what should go in a privacy
           | policy to be fully compliant (in German), note that "clear
           | and precise" language is generally understood to mean being
           | explicit about the legal basis (i.e. parts of the GDPR) under
           | which the data is collected and processed:
           | https://www.datenschutz.org/datenschutzerklaerung/
           | 
           | In any case, your privacy policy link on the German language
           | version of your website gives me the policy in English, which
           | violates the GDPR's requirements for "clear language"
           | regardless of the actual content by not being in German:
           | https://voicedocs.com/de/legal/privacy-policy
           | 
           | But to be honest, you shouldn't be asking a random person on
           | HN, you should talk to a lawyer.
        
       | andix wrote:
       | No, because of data protection.
       | 
       | I won't upload recordings (with possibly sensitive information)
       | to a third party.
        
         | MajidMM wrote:
         | There is a data protection policy, of course.
        
           | creshal wrote:
           | Which only confirms that it's impossible to use your service
           | and stay in compliance with GDPR.
        
             | hnbad wrote:
             | Can you clarify? They're a German company and state that
             | they do not share uploaded audio recordings with third
             | parties.
             | 
             | You'll need to sign a DPA with them to be compliant with
             | the GDPR tho, and they'd need to disclose where the data
             | will be stored and processed and how they maintain control
             | over that data if it's a third party.
        
               | [deleted]
        
           | HenryBemis wrote:
           | > "Trusted by organizations of all sizes".
           | 
           | Apart from Itep Pictures, all others seem to be in Turkey.
           | Are you based in Turkey?
           | 
           | If yes, allow me to place _zero trust_ on everything-Turkey,
           | under the current government /leadership. I strongly believe
           | that Turkey lacks the basic/fundamental freedoms and rule of
           | law is going whichever way this regime's leader wants it to
           | go.
           | 
           | I would similarly hesitate to upload such data to Iran, North
           | Korea, Syria.
           | 
           | If no, where are you based?
        
             | rauf_f wrote:
             | No, the company is based in Germany.
        
       | kasperni wrote:
       | Any SAAS service that process spoken or written language should
       | clearly state what languages they support on the frontpage.
        
         | camillomiller wrote:
         | Very very good point
        
       | js8 wrote:
       | Not against the idea per se, but man, I wish people understood
       | that a recording of a conversation is NOT a replacement for good
       | notes or documentation.
       | 
       | Recording conversation means saving time of an expert in exchange
       | of additional time spent by the student, when looking things up
       | in it. Having good notes/docs is easier for students, but more
       | expensive for the expert, who needs to spend more time to
       | organize the information properly.
       | 
       | So depending on what you're doing, there might be different
       | tradeoffs.
        
         | rauf_f wrote:
         | can the recording and then transcribing with good editor (with
         | automatic speech-to-text built-in) be a good solution?
        
           | js8 wrote:
           | My point is, recording (even if we talk about something like
           | a chat log) is just data, but to convert it to notes, i.e.
           | information, you need to do additional (editing) work. That
           | is the hard problem.
           | 
           | Sure, searchable conversation data are better than nothing.
           | But it is, by definition, disorganized. I worry about the
           | future where people will stop making notes/docs just because
           | they can record everything.
        
       | mesaframe wrote:
       | Doesn't Google recorder does freely for you?
        
         | rauf_f wrote:
         | Yes, but there's still a lot of work after getting the
         | transcript, even if it is accurate. Researchers/journalists
         | need a good editing tools for reviewing, editing, summarizing
         | and etc.
        
       | Erazal wrote:
       | :s/voicedocs/My own shameless plugin/g
       | 
       | We provide the same functionnality (except for the Word export)
       | 
       | + direct recording and upload from Zoom, Hangouts, etc.
       | 
       | + video / audio editing & sharing by high-lighting which part of
       | the transcript you'd like to keep.
       | 
       | www.spoke.app :)
       | 
       | In 70 languages (see language list here: https://spoke-for-sumo-
       | lings.webflow.io/)
        
       | jarym wrote:
       | I've used Sonix.ai for this type of thing before. Simple pricing
       | and nice editor. The only thing it could benefit from is speaker
       | identification.
       | 
       | Apart from your service offering an onprem option is there much
       | else difference?
        
       ___________________________________________________________________
       (page generated 2021-05-10 23:02 UTC)