This is not written from WAI perspective, but also from the perspective of making video/audio WAI usable.
Before jumping into syntax at the end I would like to consider the requirements for captions and transcriptions. Here are four of mine, in part based on an earlier post I have made.
1. IT HAS TO BE COLLABORATIVE
Look at the process where transcripts are made. Take the TED.com videos as an example, which I think is more useful than e.g. YouTube.
0. Binary video
Step 1: Captioning
1a. autocaption (Speech Recognition)
1c. contributors turn into regular language, add value
1e. editors takes the contributions and fix them, add quality
Step 2: Caption translation
2c. contributors translate into target language and script, add value
2e. editors takes the contributions and fix them, add quality
Step 3: Encoding context
3a. Automatic metadata
3c. contributors correct and tag the video, add value
3e. editors takes the contributions and fix them, add quality
Step 2 is dependent on step 1, while step 3 is semi-independent and can be shunted off to some metadata discussion. Some of it is relevant, though. TED talks are talks made by a single speaker, but in interviews or discussions there may be multiple voices.
What I am aiming at however is that in practice these steps are made by different people, often at different times and places. The infrastructure has to allow for that. The one making the video may not be the ones to make the captions or transcriptions, who may not be the ones to translate them into readable Mandarin Chinese.
2. IT HAS TO HAVE SENSIBLE FALLBACKS
In the absence of collaboration we have to rely on automatic captioning and translations, provided by the web site and/or the user agent. Having both will likely end up with double dutch with today's technology, but it is still better than nothing. The transcription from that Urdu video may not be easily understandable but at least we might get an inkling what it is about and whether we should spend effort getting a better translation. A caption made by a human should be vastly better, though in real cases it is not unlikely that a caption is neither made by a human nor better than what a UA can provide.
This extends to transcriptions versus captions. captions versus subtitle, and the interaction with the description, chapter, and metadata types. Fairly obviously a subtitle can with some degradation serve as a caption, or a caption as a subtitle. The rule for what would serve as a transcript in the absence of a transcript track is less obvious, but necessary if transcript is to work.
It seems to me that like caption is a subtitle in the absence of audio, a transcription is a caption in the absence of video. Furthermore it need not be timed, and would not normally be rendered as timed. It would however be a mistake to remove the timedness of transcriptions, even if not presented as a timed text.
Most "real world" transcripts are edited, removing umms and errs, filler words, non-verbal cues, and pauses, having a more written form than oral form. But this is not an inherent characteristic of transcripts, a linguistic transcript could go to considerable lengths, pauses and laughter could be significant, even the length of the pause, as with the Nixon tapes.
Thus subtitles, captions, and transcripts could be used as fallbacks for each other, but it should be specified how.
3. IT HAS TO SUPPORT SOME TRANSCRIPTION VIEWPORT
A transcription is meant to be read, not watched. It may or may not be read with the transcribed video active.
A good model for transcripts is TED.com (done with scripting rather than HTML5, but the functionality should be replicated). Take an example a TED talk like "Building blocks that blink, beep and teach". You can pick subtitles in one out of 23 different languages/transcriptions (at current). If you click the "Interactive transcript" button you can get a transcript in either of these languages. Incidentally you don't have to pick the same language for subtitle and transcript.
Crucially, the transcript is interactive and timed, even though the time cues are not visible. Change the language to Simplified Chinese, activate the text in "在大约100年后的1947年" and you get to the point 45 seconds into the video where the speaker talks about LEGO. The video and the transcript are independent but linked. This is a very valuable property.
4. IT HAS TO HAVE MULTI-LANGUAGE SUPPORT
Like with TED future videos should support any language the authors and contributors are able to encode simultaneously.
Here in China it isn't unheard of to have three subtitles shown simultaneously (Simplified Chinese characters, Pinyin, and English), though a double subtitle track (e.g. Simplified Chinese and English) is far more common.
While there is a use case for multiple subtitles, it would be less common for transcriptions. A selection box UI like what TED uses to pick one transcription out of multiple seems more natural.
SYNTAX? WHAT SYNTAX?
I don't care that much about syntaxes, but would say that the first syntax is both more natural, putting transcripts on line with captions and subtitles, and more flexible given that popular videos should have a multitude of languages (and thus a multitude of transcripts).
The second syntax seems to have been born out of a concern that while the video may have a transcription it seems trapped inside the video container. It should be possible to display text tracks in another context, either embedded in the same document like TED does with "Interactive transcript" or by linking to it.
Furthermore it should be possible for the UA to display a transcript whether or not the author has neglected to make a TED-like "transcript" container or if the transcript has been made by a different contributor than the original author.
This seems as yet unanswered in the spec. We have the transcript (or a transcript fallback), but how can it be activated? Will it be interactive (like the TED case) by default?