[HN Gopher] Automatically transcribe an interview, meeting or video
___________________________________________________________________
Automatically transcribe an interview, meeting or video
Author : MajidMM
Score : 66 points
Date : 2021-05-10 08:24 UTC (14 hours ago)
(HTM) web link (voicedocs.com)
(TXT) w3m dump (voicedocs.com)
| eloeffler wrote:
| Fun fact: In Germany, most state parliaments and the state
| parliament still use hand-written stenography for protocols
| because it is still most reliable (catching all: shouts, noise-
| expressions from the crowd, etc.) and wasn't replaced by a typing
| system because up to date there is no typing stenography that
| keeps up with the speed of hand-written stenography (in German
| language).
| gumby wrote:
| Note that recording via stenography is a two-step process.
|
| The first is to record the _sounds_ you hear. Look at a common
| stenographic "alphabet" (often called "shorthand alphabet"
| though that practice is essentially dead) or at the keyboard of
| a stenographic machine.
|
| Then the stenographer reads the output (either hand or machine
| generated) and writes a text using a combination of cue (from
| the paper) and memory.
|
| This is quite different from trying to do straight text-to-
| speech.
| fxtentacle wrote:
| I once tried to build a German service for transcribing online
| meeting calls, similar to what UberConference now offers, by
| using a cloud API for the STT.
|
| Oh wow was I surprised to see the quality. All of the cloud
| providers are abysmally bad at transcribing German.
|
| I believe the reason is that in German, you can make up word
| combinations on the fly and use them as valid nouns. And people
| do that, if it's convenient or if it enables you to be more
| precise.
|
| "Dampfschiffahrtsgesellschaft" = Society (Gesellschaft) for
| Driving (Fahrt) of Boats (Schiff) with Steam (Dampf)
| creshal wrote:
| > Society (Gesellschaft)
|
| In this context, Gesellschaft translates to Company.
| (GmbH=LLC)
|
| The spelling also depends on whether you're talking about the
| historical Erste Donau-Dampfschiffahrts-Gesellschaft or any
| generic Dampfschifffahrts-Gesellschaft - note the ff vs. fff
| in middle; the old company name retains its pre-1996
| spelling.
|
| Donaudampfschiffahrtsgesellschaft without hyphens was as far
| as I can tell never officially used by the company, but used
| informally as part of the name of the
| Donaudampfschiffahrtsgesellschaftskapitanstango, a 1930s
| song.
| rvba wrote:
| After the simplification of spelling system ( https://en.wiki
| pedia.org/wiki/German_orthography_reform_of_1... ), German
| got a big advantage: you can write and read nearly
| everything, even if you don't know its meaning.
|
| Due to much more complicated grammar German is much more
| difficult to learn than English, but at least the spelling is
| easy.
|
| I wonder why more languages never try to simplify their
| orthographies. Children could spend years learning useful
| things, instead of wasting time on spelling.
|
| Controversial opinion here: they should have removed ss
| (sharffes S) completely. It is still used in some relatively
| rare cases.
| thunderbong wrote:
| After reading the last line of your comment I put the words
| back in the German word (in English) and got -
|
| SteamBoatDrivingSociety
|
| Which actually made complete sense even in English!
| spzb wrote:
| In my experience, they're pretty poor in English too.
| Especially when it's ad hoc conversation where people don't
| finish sentences, repeat themselves, "um" and "err" etc
| hnbad wrote:
| A problem most people don't think about when talking about
| transcription is that people don't talk like books. Not
| only do you get unfinished sentences and filler words, you
| also get garbled words, non-standard pronunciation, and so
| on.
|
| In the case of pronunciation this primarily poses a problem
| with detecting the intended word, but in other cases
| "cleaning up" the output may lose contextual information
| (e.g. what a speaker was going to say before cutting
| themselves off and using a different word). This is
| difficult enough for a human to get right, let alone a
| machine.
| rob74 wrote:
| Also, when transcription programs (most familiar example:
| YouTube) fail, they usually fail on the words that a
| human listener would also have trouble understanding /
| telling apart. So the transcription is useful if you are
| deaf or forced to watch the video without sound, but if
| you're using subtitles because your English is not good
| enough to understand the speakers without them, their
| usefulness is pretty limited...
| hobofan wrote:
| Once you go beyond ~7 words (= what people would utter to
| their virtual assistant), the quality of all off the shelf
| tools (both open source and offered services) is laughable.
| Sentence boundary detection, punctuation, speaker
| segmentation, and all those features you would need for
| good transcription are in a really bad state.
| lostinthefield wrote:
| "Really bad" is an exaggeration, I think. The auto-
| transcription features in both Google Meet and Zoom are
| more than acceptable, they're often very useful in
| catching missed words during a meeting.
|
| They trip up on technical jargon but handle everyday
| conversations just fine, including speaker detection,
| punctuation, idioms, etc.
|
| But that's also a slightly different use case, where each
| speaker is in their own (somewhat) quiet environment and
| on separate connections (and thus audio tracks).
|
| It's much harder to do all that after the fact, like with
| a recorded video.
|
| I find Trint.com, which is partially automatic, to be
| good for that... the AI does a first pass, and a human
| cleans it up afterward. YouTube has a similar assisted-
| auto feature for their captions, minus speaker
| separation.
| fr33k3y wrote:
| I'm curious about your service, can you explain what's different
| from similar services like happy scribe for instance?
|
| As it was already said you should make clear which languages are
| supported.
|
| And I think you should put prices in USD and/or Euros instead of
| TL (turkish lira), ideally Euro's for european visitors and UDS
| for the rest of the world. Besides the free tier, if I'm serious
| about the service I will be less keen to test it out before
| knowing the cost of it and at first I've seen the price without
| looking too much and thought it was pretty expensive before
| understanding it was expressed in TL's.
| rauf_f wrote:
| Sorry, there was a bug in pricing page. It should now show the
| prices in USD. The difference is own Speech recognition engine,
| easy document-like editor and separate subtitle editor.
| bkovacev wrote:
| Do you support speaker diarization?
| jcims wrote:
| I think you're going to have a hard time competing with the major
| cloud providers on transcription alone. AWS Transcribe, for
| example, is quite easy to use and supports batch transcription as
| well as streaming, custom language models, etc.
|
| There's still quite a bit of value-add possible on top of that,
| however. The ability to edit transcriptions is a great start,
| especially if you maintain timecodes against the media.
| Developing or curating domain-specific language models to improve
| accuracy is also a likely option. There also appears to be a lot
| of interest in using real time transcription to augment live
| events with content derived from the conversation.
|
| Good luck!
| rauf_f wrote:
| Thanks for the note! As you have stated, there's still a lot of
| work for researcher/journalist after getting raw transcription,
| so good editing tool syncing audio and text is valuable here.
| varispeed wrote:
| I wouldn't feel easy uploading sensitive information for
| "transcription". Who is this service for? As an interviewee I
| also wouldn't consent that potential employer could disclose my
| information in such way.
| Clewza313 wrote:
| Quite a few large companies intentionally do not use audio
| transcription services, because they don't want the liability of
| everything everybody has ever said being written down in a format
| that can easily searched during legal discovery.
| offtop5 wrote:
| Considering the AWS API is essentially open for everyone to start
| a transcription service, what exactly is the difference here. If
| you know what you're doing you can build this in about 4 hours.
| frankenst1 wrote:
| And "Dropbox is just SVN mounted on top of curlftpfs" - doesn't
| mean there isn't a market to make technological capabilities
| more easily accessible for the masses.
|
| That being said, I am skeptical about the quality and would
| like to see some demos. Audio recordings of meetings are
| especially difficult to transcribe accurately.
| offtop5 wrote:
| That's a great point, but I'm seeing an absolute explosion of
| transcription services which are all essentially based on
| AWS.
|
| The only real innovation here is when this is combined with
| language learning apps to help me practice my Chinese
| pronunciation, but even then I know I'll have to look to hire
| a tutor soon.
| zackees wrote:
| I just made a python package that does everything you are
| offering but for free. And yes it does direct links to youtube
| and twitter.
|
| pip install transcribe-anything transcribe_anything <YT_VID>
| out.txt
| CharlesW wrote:
| Anyone know why this is interesting enough to be on Hacker News?
| There are lots of services which do this, many of which have
| significantly better functionality. This just looks like a very
| thin wrapper on a cloud speech-to-text service.
| rauf_f wrote:
| The company builds its own speech-to-text engine and has better
| accuracy than Google in German and Turkish languages.
| Independent review:
| https://www.abtipper.de/transkription/sprache-zu-text/
| disabled wrote:
| This is a good tool to use when dealing with health insurance in
| the United States. You should at minimum keep an Excel
| spreadsheet of date, whom you talked to, which department they
| are from, purpose of the call, follow up actions, etc.
|
| But, with the way insurance has been going in the US lately, you
| better be recording and transcribing that call. Usually, if the
| call line is recorded (basically all US health insurance
| companies do this) you can legally record the phone call without
| permission from the other party.
|
| I personally have an NVIDIA Jetson AGX Xavier with AI tools for
| speech-to-text, person identification, and transcribing, which I
| use for important phone calls. I use my own AI tools and devices
| for privacy reasons.
| rubatuga wrote:
| Please let us know what models you use for STT!
| laurex wrote:
| The pricing on this transcription is very high ($12/hr) for
| automated transcription. Compare to existing solutions like
| Descript, Rev, Otter.ai - what makes it so much better?
| hnbad wrote:
| Obvious caveat that automatic transcriptions are not a
| replacement for manual transcriptions. They're better than
| nothing but the problem with mistakes in automated transcriptions
| is that they can entirely change the meaning of a statement in
| ways that are not necessarily obvious if you don't listen to the
| audio at the same time.
|
| They also struggle with domain specific jargon depending on what
| data they were trained on. While manual transcriptions will mark
| ambiguous utterances as such (or ask for additional information),
| automation can create a false sense of certainty while just
| "guessing" whatever it matches most closely. This is a hard
| problem and unlikely to be solved soon.
| ghaff wrote:
| I find they serve different use cases.
|
| ML transcriptions are fast/cheap and they're fine if you mostly
| want to pull out some quotes or check some things in your
| notes. But, in general, I find they're not remotely worth my
| time if I'm going to publish a transcript in which case I get a
| human transcription. (And even that can be a bit tough with
| accents, technical jargon, overlapping voices, etc.)
| hnbad wrote:
| I would agree but given that automatic transcriptions are
| cheaper, many people treat it as an alternative when manual
| transcriptions would be more appropriate.
|
| Some tech conferences were pretty good about hiring actual
| people for live captioning, which was great, but with
| conferences mostly happening online via video streams at the
| moment, automated captions and transcriptions might seem like
| an obvious choice if you don't understand the limitations.
| Johnyma22 wrote:
| I wonder if you could transcribe really-real time into something
| like Etherpad? https://etherpad.org
| robsalasco wrote:
| is the spanish language supported?
| mjparrott wrote:
| A lot of spoken text is highly inefficient to read.
| Aeolun wrote:
| This is cool, but what exactly do you transcribe? What happens if
| I upload a Spanish or Japanese video?
| hnbad wrote:
| I just noticed but even the German language footer doesn't
| included a link clearly labelled "Impressum". That information
| seems to be in the privacy policy (which I can only get in
| English even when switching to German?) but that is not
| sufficient to meet German legal requirements.
|
| The privacy policy also doesn't provide all the information the
| GDPR generally requires you to provide, e.g. spelling out users'
| rights under the GDPR and what legal basis is given for
| collecting each specific piece of information.
|
| I'm mostly pointing this out because it could get them sued, but
| I'd also expect a company based on a service like this to take
| privacy a bit more seriously, or at least present themselves as
| if they do so.
| rauf_f wrote:
| Thanks for the review. The privacy policy lists all collected
| information, how (if any) they're shared with other parties,
| also right for the users to delete the information any time.
| What else should be listed here? I didn't get the "legal basis
| for collecting each information" -- is it required? This is
| just basic information that software needs to operate.
| hnbad wrote:
| Well, first of all, you need a link clearly indicating it's
| the "Impressum" (usually translated as "imprint", "legal" or
| similar in English versions) as per SS5 TMG: https://de.wikip
| edia.org/wiki/Impressumspflicht#Telemedienge...
|
| You can get sued for omitting such a page (by any bored
| lawyer really) because it's considered anti-competitive and a
| misdemeanor: https://de.wikipedia.org/wiki/Impressumspflicht#
| Ordnungswidr...
|
| Here's a lengthy explainer of what should go in a privacy
| policy to be fully compliant (in German), note that "clear
| and precise" language is generally understood to mean being
| explicit about the legal basis (i.e. parts of the GDPR) under
| which the data is collected and processed:
| https://www.datenschutz.org/datenschutzerklaerung/
|
| In any case, your privacy policy link on the German language
| version of your website gives me the policy in English, which
| violates the GDPR's requirements for "clear language"
| regardless of the actual content by not being in German:
| https://voicedocs.com/de/legal/privacy-policy
|
| But to be honest, you shouldn't be asking a random person on
| HN, you should talk to a lawyer.
| andix wrote:
| No, because of data protection.
|
| I won't upload recordings (with possibly sensitive information)
| to a third party.
| MajidMM wrote:
| There is a data protection policy, of course.
| creshal wrote:
| Which only confirms that it's impossible to use your service
| and stay in compliance with GDPR.
| hnbad wrote:
| Can you clarify? They're a German company and state that
| they do not share uploaded audio recordings with third
| parties.
|
| You'll need to sign a DPA with them to be compliant with
| the GDPR tho, and they'd need to disclose where the data
| will be stored and processed and how they maintain control
| over that data if it's a third party.
| [deleted]
| HenryBemis wrote:
| > "Trusted by organizations of all sizes".
|
| Apart from Itep Pictures, all others seem to be in Turkey.
| Are you based in Turkey?
|
| If yes, allow me to place _zero trust_ on everything-Turkey,
| under the current government /leadership. I strongly believe
| that Turkey lacks the basic/fundamental freedoms and rule of
| law is going whichever way this regime's leader wants it to
| go.
|
| I would similarly hesitate to upload such data to Iran, North
| Korea, Syria.
|
| If no, where are you based?
| rauf_f wrote:
| No, the company is based in Germany.
| kasperni wrote:
| Any SAAS service that process spoken or written language should
| clearly state what languages they support on the frontpage.
| camillomiller wrote:
| Very very good point
| js8 wrote:
| Not against the idea per se, but man, I wish people understood
| that a recording of a conversation is NOT a replacement for good
| notes or documentation.
|
| Recording conversation means saving time of an expert in exchange
| of additional time spent by the student, when looking things up
| in it. Having good notes/docs is easier for students, but more
| expensive for the expert, who needs to spend more time to
| organize the information properly.
|
| So depending on what you're doing, there might be different
| tradeoffs.
| rauf_f wrote:
| can the recording and then transcribing with good editor (with
| automatic speech-to-text built-in) be a good solution?
| js8 wrote:
| My point is, recording (even if we talk about something like
| a chat log) is just data, but to convert it to notes, i.e.
| information, you need to do additional (editing) work. That
| is the hard problem.
|
| Sure, searchable conversation data are better than nothing.
| But it is, by definition, disorganized. I worry about the
| future where people will stop making notes/docs just because
| they can record everything.
| mesaframe wrote:
| Doesn't Google recorder does freely for you?
| rauf_f wrote:
| Yes, but there's still a lot of work after getting the
| transcript, even if it is accurate. Researchers/journalists
| need a good editing tools for reviewing, editing, summarizing
| and etc.
| Erazal wrote:
| :s/voicedocs/My own shameless plugin/g
|
| We provide the same functionnality (except for the Word export)
|
| + direct recording and upload from Zoom, Hangouts, etc.
|
| + video / audio editing & sharing by high-lighting which part of
| the transcript you'd like to keep.
|
| www.spoke.app :)
|
| In 70 languages (see language list here: https://spoke-for-sumo-
| lings.webflow.io/)
| jarym wrote:
| I've used Sonix.ai for this type of thing before. Simple pricing
| and nice editor. The only thing it could benefit from is speaker
| identification.
|
| Apart from your service offering an onprem option is there much
| else difference?
___________________________________________________________________
(page generated 2021-05-10 23:02 UTC)