Speech Services Control (speechsc) Charter

2.8.20 Speech Services Control (speechsc)

Last Modified: 2003-07-21

Chair(s):

David Oran <oran@cisco.com>
Eric Burger <eburger@snowshore.com>

Transport Area Director(s):

Allison Mankin <mankin@psg.com>
Jon Peterson <jon.peterson@neustar.biz>

Transport Area Advisor:

Jon Peterson <jon.peterson@neustar.biz>

Mailing Lists:

General Discussion: speechsc@ietf.org
To Subscribe: speechsc-request@ietf.org
In Body: subscribe
Archive: www.ietf.org/mail-archive/working-groups/speechsc/current/maillist.html

Description of Working Group:

Many multimedia applications can benefit from having Automated Speech
Recognition (ASR), Text to Speech (TTS), and Speaker Verification (SV)
processing available as a distributed, network resource. To date, there
are a number of proprietary ASR, TTS, and SV API's, as well as two IETF
drafts, that address this problem. However, there are serious
deficiencies to the existing drafts relating to this problem. In
particular, they mix the semantics of existing protocols yet are close
enough to other protocols as to be confusing to the implementer.

The speechsc Work Group will develop protocols to support distributed
media processing of audio streams. The focus of this working group is
to develop protocols to support ASR, TTS, and SV. The working group
will only focus on the secure distributed control of these servers.

The working group will develop an informational RFC detailing the
architecture and requirements for distributed speechsc control. In
addition, the requirements document will describe the use cases driving
these requirements. The working group will then examine existing
media-related protocols, especially RTSP, for suitability as a protocol
for carriage of speechsc server control. The working group will then
propose extensions to existing protocols or the development of new
protocols, as appropriate, to meet the requirements specified in the
informational RFC.

The protocol will assume RTP carriage of media. Assuming
session-oriented media transport, the protocol will use SDP to describe
the session.

The working group will not be investigating distributed speech
recognition (DSR), as exemplified by the ETSI Aurora project. The
working group will not be recreating functionality available in other
protocols, such as SIP or SDP. The working group will offer changes to
existing protocols, with the possible exception of RTSP, to the
appropriate IETF work group for consideration. This working group will
explore modifications to RTSP, if required.

It is expected that we will coordinate our work in the IETF with the
W3C Mutlimodal Interaction Work Group; the ITU-T Study Group 16 Working
Party 3/16 on SG 16 Question 15/16; the 3GPP TSG SA WG1; and the ETSI
Aurora STQ.

Once the current set of milestones is completed, the speechsc charter
may be expanded, with IESG approval, to cover additional uses of the
technology, such as the orchestration of multiple ASR/TTS/SV servers,
the accommodation of additional types of servers such as simultaneous
translation servers, etc.

Goals and Milestones:

Done		Requirements ID submitted to IESG for publication (informational)
Done		Submit Internet Draft(s) Analyzing Existing Protocols (informational)
Done		Submit Internet Draft Describing New Protocol (if required) (standards track)
Oct 03		Submit Drafts to IESG for publication

Internet-Drafts:

- draft-ietf-speechsc-reqts-04.txt

- draft-ietf-speechsc-protocol-eval-02.txt

No Request For Comments

Current Meeting Report


speechsc working group minutes, Wed July 16
reported by Edwin Aoki <aoki@aol.net>


Eric Burger and David Oran chair


Administrivia and Agenda Bashing
--------------------------------


Proposed Agenda:
 Agenda Bashing           5 min
 Requirement Status       4 min
 Protocol Proposal       90 min
 Protocol Analysis       20 min
 Wrap up and next steps


There were no objections to the agenda as proposed.



Requirements Status - Dave Oran
-------------------------------


The requirements document was in the IESG for some time, and the 
majority of comments were integrated into 
draft-ietf-speechsc-reqts-04. The security ADs asked for a couple minor 
changes, which will be included in an -05 draft, including a reference to 
the risks of use of biometrics, including speaker identification and 
speaker verification.


After those changes, the draft will go to the RFC editor.


Guido from the RNID had requested some changes in the wording of section 
3.9.  Dave indicated that he'd thought that those changes were already 
incorporated in the -04 draft; Guido thought his comments were for -04. 
Guido will verify that his comments are still appropriate for the -04 
draft.


speechsc Protocol Proposal - Sarvi Shanmugham (via audio link)

----------------------------------------
----------------------



http://www.ietf.org/internet-drafts/draf
t-shanmugham-speechsc-00.txt


The protocol proposal is now in draft form, based on the MRCP 
proposal, also now in draft form.  However, there was some issues that came 
up relating to MRCP's tunneling capability.  The proposal proposes a 
SIP-based framework as a control channel to initiate sessions between 
client and server.  The control channel will run over TCP or SCTP and will 
not use an unreliable protocol such as UDP.


This proposal doesn't address speaker identification or speaker 
verification.


Advantages:
* The speechsc exchange is simple because it need not work around the 
unreliability of the protocol
* Allows for TCP/SCTP connection sharing, unlike RTSP, which requires the 
client to open a separate connection to the server for each session.
* Leverages MRCP - the state machine and flow are the same as MRCP, and are 
therefore well-understood


Issues:


Most of the issues that have been raised on the list have been noted and 
simply need to be incorporated into the next set of drafts.  Sarvi 
presented a slide which listed the known issues, and the remainder of the 
discussion focused around these issues (and others that would come in in the 
course of the discussion).  The chairs took a quick show of hands, which 
revealed that a few people have read the most recent draft.


* Issue 1: Define SI and SV


The author has received some responses from a few people who might be 
interested in working on the SI and SV problem, but if there are 
additional people who are interested, they should contact the WG chairs.


Dan Burnett has volunteered.


* Issue 2: Why use SIP (Bryan Wild and others)


There was some discussion around the choice of SIP.  Morna Hirsch asked the 
question (which Bryan Wild and others have asked on the list) why we 
wouldn't continue with the use of RTSP and extend that instead of going all 
the way to SIP?


Sarvi explained that two issues that while RTSP was being used, 
speechsc was primarily using MRCP as a TCP pipe and so therefore it 
worked.  The desire was to move the messages to the top layer without 
requiring tunneling, and the separation of the control channel provided a 
clean way to do this.  Additionally, going to SIP allowed for reuse of the 
TCP pipe between client and server.


In getting some more detail around the use of SIP for speechsc, Colin 
asked whether the proposal was a subset of SIP, or whether there would be 
parts of SIP that people would expect to work, that wouldn't when used in a 
speechsc context.


Sarvi explained that everything one would expect for a standard RFC 
3261-compliant UA would work; it is not a subset of SIP and there's no 
expectation that a profile would be needed.


The chairs took a hum on the question: "Is there consensus on using SIP as 
the session initiation protocol for speechsc?"  The hum indicated rough 
consensus for the statement; there was no opposition.


The chairs then took a hum on the question of whether it would be 
appropriate to adopt this draft as a WG item.  Again, there was no 
opposition, but only a light hum in favor.  The chairs will take this 
question to the list.


* Issue 3: Multiple resources of a given type


Dan Burnett asked regarding section 3.2 for some more 
clarification on adding and removing resources.  Is it possible to have, for 
example, multiple ASR resources and then to be able to drop just one?  As 
long as there are only references to resource type and not to specific 
resources, it's unclear what would be dropped?


There was some discussion around why one would want to have multiple 
resources - for example to have multiple recognizers in parallel, but the 
current draft does not consider having multiple resources on a single 
session.


Further discussion was taken to the list.


* Issue 4: Resource Tokens as strings


The protocol currently defines resources by an integer number. In an XML 
format, it costs the same (in bytes) to use strings such as "SI", "SV", or 
even "ASR" or others.  Colin and Eric independently asked the question of 
the extensibility of the namespace and whether strings could be used 
instead of numbers.


Sarvi indicated that he was open to using strings, perhaps even URIs of the 
form channel ID@asr.


There was some followup discussion on whether these strings would be 
arbitrary, negotiated strings, or fixed strings as in an IANA 
registry.  The discussion seemed to focus around leaning towards 
specific strings by resource types.


The chairs asked for a concrete proposal to be sent to the list 
(sarvi?)


* Issue 5: Use of the m= line


Neil Deason brought up the issue of how one would specify the choice of TCP 
or SCTP given the current specs.  Two options were proposed.


Proposal 1: One m= line, with a protocol ID of "speechsc" and where the 
MIME type is a resource ID


Proposal 2: One m= line with the protocol ID being the actual protocol used 
(TCP or SCTP), MIME type of "application/speechsc" and additional 
attributes a=resource ID <type>, a=channel ID <identifier>


There were no comments on this and further discussion was taken to the 
list.


* Comment


Adam Roach made the comment that having content-length headers in the 
middle of the data has proven difficult to implement efficiency in other WGs 
(like SIP).  Subsequent work, for example in MSRP, has gone to more of a 
fixed-position framing for the ease of parsing.  Various other options 
include include either an easy to parse byte count, or well-known leader 
text (a la MIME parts).  This makes it easier to parse without having to 
pull in the entire message.


Protocol Analysis Document - Eric Burger

----------------------------------------


The document is complete, though it still needs some more work, 
particularly cross-review.  A show of hands showed that 3 or 4 people had 
read it.  So now what?  Does this document need to be published? Does it 
need to be kept alive for the duration of the protocol?  etc.


The AD felt that if it was interesting and/or worthwhile or could convey 
some of the rationale for using IETF-supported protocols rather than not, 
that it would be useful to document.


There was some collective intuition that it would be good to ahve 
documented the reasons why the group moved in the direction that it did, 
particularly because the group has made a fairly significant change in 
direction.  As of now, however, the document is not in a publishable 
state, and needs further work.


Milestone Review - Eric Burger
------------------------------
The group is a little ahead of schedule on the milestones as far as draft 
submissions are concerned.  The milestones will be updated coming out of the 
Vienna meeting.

Slides

Speechsc Protocol Proposal

Presentation 1

SPEECHSC

Presentation 2