ACS Proposal - Device Internal Status Log July 28, 2010 Revision 3 Author: Nathan Obr 1 Microsoft Way Redmond, WA 98052 425-705-9157 NatObr@Microsoft.com ACS Proposal - Device Internal Status Log 1
Document Status Revision History Rev Date Description 00 May 10, 2010 Investigation of Scope and requirements 0 May 27, 2010 First Draft 1 June 08, 2010 Incorporated Ad Hoc meeting feedback 2 June 23, 2010 Incorporated feedback on the r1 revision from the June T13 meeting 3 July 28, 2010 Incorporated Ad Hoc meetings feedback ACS Proposal - Device Internal Status Log 2
1 Introduction Storage is a critical component for computer systems and storage failures are one of the most expensive to address. This is particularly true for enterprise IT. Storage device are sophisticated systems with firmware codebase exceeding 0.5M lines of code, highly tuned electromagnetic media and constantly growing capacities. Greater complexity leads to greater difficulty in diagnosing failures, as well as greater opportunities derived from detailed device logging. One approach to addressing increasing system complexity is enabling a telemetry feedback loop reporting both health monitoring and component failure data for analysis. This benefits users via continuous improvement in manufacturing components as well as software behavior. For high volume manufactured components, telemetry data collection should be thought of as part of the standard engineering process. 2 Scope The purpose of this proposal is to identify and catalog the current goals and requirements for the inclusion of a telemetry mechanism into ATA8-ACS. All section changes are made with reference to d2015r3- ATAATAPI_Command_Set_-_2_ACS-2. It establishes the scope and dataflow that the targeted telemetry system must provide and it proposes a mechanism for notifying and collecting telemetry data from the device at critical times. 3 Overview Enabling telemetry systems creates an opportunity to significantly improve storage experience and perceived reliability for practically all computer applications by enabling device vendors to get internal disk logs through any telemetry system. Creating a closed feedback loop for storage devices in the field will lead to better devices. Continuous connection of vendors to their users experiences can lead to improved quality of the firmware and components Past experience has indicated that some degree of flexibility in size is desired in the collection of telemetry payloads. Issues discovered through telemetry tend to go through 3 phases. Phase 1 establishes that an issue exists in the field and is best accomplished by collecting a minimum set of data to identify the issue as being distinct from other issues. Once the number of instances of an issue grow large enough to prioritize investigation, another phase may be necessary. In Phase 2, a targeted collection of more in depth medium size payloads from the field are gathered and analyzed in order to identify the source of the problem. Finally for those rare few issues that can t be diagnosed by medium size payloads a heroic phase may be justified. In Phase 3, one or two large size payload are collected and analyzed in order to be able to diagnose the issue. An important aspect to issue discovery through telemetry is the ability to qualify distinct issues that are being collected. The ability to create a one to one mapping of issues to payload collections is essential. If a one to one mapping can t be created there is the risk that several payload collections appear distinct but are actually all caused by the same issue. Conversely, a single payload collection may have payloads caused by several issues mixed together making it difficult to understand what the cause of the problem is. The remainder of this section provides a high level description of these aspects of this feature as well as its other uses through timing and dataflow, scope of requirements and goals, and terms defined as necessary. 3.1 Proposed Definition of Terms for Solution Device Internal Status Command: ATA method including command code and parameters that completely defines a single transfer of device internal status data. Device Internal Status Data Header: The standard public portion of the device internal status data returned by the device in response to the Device Internal Status Command. Device Internal Status Data: The vendor specific private portion of the device internal status data returned by ACS Proposal - Device Internal Status Log 3
the device in response to the Device Internal Status Command. Device Internal Status Data Type: One of several amounts of device internal status data returned in response to the Device Internal Status Command. 3.2 Device Internal Status Data Collection Requirements Support for the Device Internal Status Command along with Device Internal Status Data Types and their sizes are discoverable. The Device Internal Status Command includes the size of data to be transferred. The Data size applies both the Device Internal Status Data Header and the Device Internal Status Data. The device and host both know the amount of time the device is given before the command will be aborted by the host. The Device Internal Status Data includes a Device Internal Status Data Header which will include: Routing information that can be mapped 1:1 to an organization which should receive the data Routing information that correlates the issue encountered with the proper remedial action to apply to the failing system/device by the host A device identification tag that can be used to associate a specific device to manufacturing data for that specific device Information identifying the host or device as the trigger of the collection of the device internal status data The device is able to notify the host of the presence of an existing device internal status data and its corresponding size during enumeration in a manner that is not dependent on other commands being processed. 3.3 Device Internal Status Data Collection Process Flow 3.3.1 Scenario for Current Device Internal Status Data 1. Host determines GPL support in IDENTIFY DEVICE DATA 2. Host determines Device Internal Status Command support and Device Internal Status Data Type sizes from reading GPL address 0, page 0. 3. During a system crash, the host triggers a Device Internal Status Data collection by reading the appropriate Device Internal Status Data pages. i. The device completes the command in quickly (e.g. in less than 1 second). It is recommended that a device respond quickly to avoid users rebooting the system and causing a loss of device state. 3.3.2 Scenario for Saved Device Internal Status Data 1. Host determines GPL support in IDENTIFY DEVICE DATA 2. Host determines Device Internal Status Command support and Device Internal Status Data Type sizes from reading GPL address 0, page 0. 3. During enumeration, the host determines if there is Device Internal Status Data available by reading page 0, host collects the Device Internal Status Data by reading the corresponding Device Internal Status Data pages. ACS Proposal - Device Internal Status Log 4
4 Changes to ACS 4.1 Changes to Annex A - Log Definitions 4.1.1 Changes to the Log Address Definitions Table Table A.2 Log Address Definitions Log Address Log Name Feature Set R/W Access 00h Log directory, see A.2 and A.3 N/A RO GPL,SL 01h Summary SMART Error log, see A.17 SMART RO SL 02h Comprehensive SMART Error log, see A.4 SMART RO SL 03h Extended Comprehensive SMART Error Log, SMART RO GPL see A.7 04h Device Statistics, see A.5 N/A RO GPL,SL 05h Reserved for the CompactFlash Association. 06h SMART Self-Test log, see A.16 SMART RO SL 07h Extended SMART Self-Test log, see A.9 SMART RO GPL 08h Power Conditions, see A.8 EPC RO GPL 09h Selective Self-Test log, see A.15 SMART R/W SL 0A-0Ch Reserved N/A Reserved 0Dh LPS Mis-alignment log, see A.11 LPS RO GPL,SL 0Eh-0Fh Reserved 10h NCQ Command Error log, see A.12 NCQ RO GPL 11h SATA Phy Event Counters log, see A.14 N/A RO GPL 12h-17h Reserved for Serial ATA N/A Reserved 18h-1Fh Reserved N/A Reserved 20h Obsolete 21h Write Stream Error log, see A.18 Streaming RO GPL 22h Read Stream Error log, see A.13 Streaming RO GPL 23h Obsolete TBD0 Current Device Internal Status Data Log, see N/A RO GPL A.TBD-A TBD1 Saved Device Internal Status Data Log, see N/A RO GPL A.TBD-B (TBD1+1)-7Fh Reserved N/A Reserved 80h-9Fh Host Specific, see A.10 SMART R/W GPL,SL A0h-9Fh Device Vendor Specific, see A.6 SMART VS GPL,SL E0h SCT Command/Status, see 8.1 SCT R/W GPL,SL E1h SCT Data Transfer, see 8.1 SCT R/W GPL,SL E2h-FFh Reserved N/A Key - RO - Log is read only. R/W - Log is read or written. VS - Log is vendor specific thus read/write ability is vendor specific. GPL - General Purpose Logging SL - SMART Logging Note 1 - The device shall return command aborted if a GPL feature set command accesses a log that is marked only with SL. Note 2 - The device shall return command aborted if a SMART feature set command accesses a log that is marked only with GPL. ACS Proposal - Device Internal Status Log 5
4.1.2 New Annex A sub clause TBD-A A.TBD-A Current Device Internal Status Data Log (Log Address TBD0) A.TBD-A.1 Overview The Current Device Internal Status Data Log consists of the Current Device Internal Status Data Header page (see A.TBD-A.2) and zero or more Current Device Internal Status Data pages (see A.TBD-A.3). The current device internal status data is the data representing the internal state of the device at the time the Current Device Internal Status Data log was read with bit 0 in the Feature field set to one. The current device internal status data may be retrieved by one or more reads of pages within the range of 0..n. After a reset, the contents of the Current Device Internal Status Data log are not defined until the Current Device Internal Status Data log has been read with bit 0 in the Feature field set to one. A.TBD-A.2 Current Device Internal Status Data Header Page A.TBD-A.2.1 Current Device Internal Status Data Header Page Overview The Current Device Internal Status Data Header is described in Table A.TBD2. Offset Type Description Table A. TBD2 Current Device Internal Status Data Header (page 0) 0 Byte Log Address 1-3 Bytes Reserved Organization Identifier 4-7 DWord Bit Description 31:24 Reserved 23:0 IEEE OUI 8-9 Word Device Internal Status Data Type 1 Length 10-11 Word Device Internal Status Data Type 2 Length 12-13 Word Device Internal Status Data Type 3 Length 14-381 Bytes Reserved 382 Byte Saved Data Available 383 Byte Saved Data Generation Number 384-511 Bytes Reason Identifier A.TBD-A.2.2 Log Address The Log Address field shall be set to TBD0. A.TBD-A.2.3 Organization Identifier The Organization Identifier field shall contain an Organization Unique Identifier (OUI) of the organization which is able to interpret the Current Device Internal Status Data in this Log. A.TBD-A.2.4 Device Internal Status Data Type 1 Length The Device Internal Status Data Type 1 Length field identifies the number of pages available within the Device Internal Status Data pages when the device produces Device Internal Status Data of type 1. The value in the Device Internal Status Data Type 1 Length field shall be less than or equal to the value in the Device Internal Status Data Type 2 Length field. ACS Proposal - Device Internal Status Log 6
A.TBD-A.2.5 Device Internal Status Data Type 2 Length The Device Internal Status Data Type 2 Length field identifies the number of pages available within the Device Internal Status Data pages when the device produces Device Internal Status Data of type 2. The value in the Device Internal Status Data Type 2 Length field shall be less than or equal to the value in the Device Internal Status Data Type 3 Length field. A.TBD-A.2.6 Device Internal Status Data Type 3 Length The Device Internal Status Data Type 3 Length field identifies the number of pages available within the Device Internal Status Data pages when the device produces Device Internal Status Data of type 3. A.TBD-A.2.7 Saved Data Available If the Saved Device Internal Status Data log is supported, then then the Saved Data Available field shall contain the value of the Saved Data Available field in the Saved Device Internal Status Data log (see A.TBD-B.2.7). If the Saved Device Internal Status Data log is not supported, then the Saved Data Available field shall be reserved. A.TBD-A.2.8 Saved Data Generation Number If the Saved Device Internal Status Data log is supported, then the Generation Number field shall contain the value of the Generation Number field in the Saved Device Internal Status Data log (see A.TBD-B.2.8). If the Saved Device Internal Status Data log is not supported, then the Generation Number field shall be reserved. A.TBD-A.2.9 Reason Identifier The Reason Identifier field contains a vendor specific identifier that describes the operating conditions of the device at the time of capture. The Reason Identifier should provide an identification of different unique operating conditions of the device. A.TBD-A.3 Current Device Internal Status Data Pages The Current Device Internal Status Data in the subsequent pages (see table B.TBD3) of this log shall represent the device internal state. Offset Type Description Table B.TBD3 Current Device Internal Status Data (pages 1..n) 0-511 Byte Vendor Specific ACS Proposal - Device Internal Status Log 7
4.1.3 New Annex A sub clause TBD-B A.TBD-B Saved Device Internal Status Data Log (Log Address TBD1) A.TBD-B.1 Overview The Saved Device Internal Status Data Log consists of a Saved Device Internal Status Data Header page (see A.TBD-B.2) and zero or more Saved Device Internal Status Data pages (see A.TBD-B.3). The saved device internal status data in the Saved Device Internal Status Data log is a device initiated capture of the device internal state. The contents of the Saved Device Internal Status Data log shall persist across all resets. A.TBD-B.2 Saved Device Internal Status Data Header Page A.TBD-B.2.1 Saved Device Internal Status Data Header Page Overview The Saved Device Internal Status Data Header is described in Table B.TBD4. Offset Type Description Table B.TBD4 Saved Device Internal Status Data Header (page 0) 0 Byte Log Address 1-3 Bytes Reserved Organization Identifier 4-7 DWord Bit Description 31:24 Reserved 23:0 IEEE OUI 10-11 Bytes Device Internal Status Data Type 1 Length 12-13 Bytes Device Internal Status Data Type 2 Length 14-15 Bytes Device Internal Status Data Type 3 Length 16-381 Bytes Reserved 382 Byte Saved Data Available 383 Byte Saved Data Generation Number 384-511 Bytes Reason Identifier A.TBD-B.2.2 Log Address The Log Address field shall be set to TBD1. A.TBD-B.2.3 Organization Identifier See A.TBD-A.2.3 A.TBD-B.2.4 Device Internal Status Data Type 1 Length See A.TBD-A.2.4 A.TBD-B.2.5 Device Internal Status Data Type 2 Length See A.TBD-A.2.5 A.TBD-B.2.6 Device Internal Status Data Type 3 Length See A.TBD-A.2.6 ACS Proposal - Device Internal Status Log 8
A.TBD-B.2.7 Saved Data Available If the Saved Data Available field is cleared to 00h, then the Saved Device Internal Status Data log does not contain saved Device Internal Status Data. If the Saved Data Available field is set to 01h, then the Saved Device Internal Status Data log contains Saved Device Internal Status Data. If any page of the Saved Device Internal Status Data in the Saved Device Internal Status Data log is read, then the Saved Data Available field shall be cleared to 00h. If the device saves Saved Device Internal Status Data in the Saved Device Internal Status Data log, then the Saved Data Available field shall be set to 01h. A.TBD-B.2.8 Saved Data Generation Number The Generation Number field shall contain a value that is incremented when the device initiates a capture of its internal device state into the Saved Device Internal Status Data. A.TBD-B.2.9 Reason Identifier See A.TBD-A.2.9 A.TBD-B.3 Saved Device Internal Status Data Pages The Saved Device Internal Status Data in the subsequent pages (see table B.TBD5) of this log shall represent the device internal state. Offset Type Description 0-511 Byte Vendor Specific Table B.TBD5 Saved Device Internal Status Data (pages 1..n) ACS Proposal - Device Internal Status Log 9