Wednesday, December 14, 2011

Troubleshooting 0x116 VIDEO_TDR_ERROR

The Debugging Tools for Windows are required to analyze crash dump files. If you do not have the Debugging Tools for Windows installed or dump files are not being generated on system crash, see this post for installation/configuration instructions:

http://mikemstech.blogspot.com/2011/11/windows-crash-dump-analysis.html

0x00000116 VIDEO_TDR_FAILURE is an interesting blue screen of death (BSOD) because a lot of users encounter it, and in the vast majority of forum posts (even ones that I've answered previously), neither MVP (Microsoft Most Valued Professional) nor non-MVP have been able to successfully resolve the issue in most cases, or really even break from the "update your graphics drivers" or "update your BIOS" mantra. I decided to dig into the Windows Driver Kit and really determine what this error is actually saying and see if there is a better resolution or workaround. Let's start out by looking at a dump from May 2010 using the !analyze -v debugger command and the !sysinfo machineid command,

0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

VIDEO_TDR_FAILURE (116)
Attempt to reset the display driver and recover from timeout failed.
Arguments:
Arg1: fffffa8004c50310, Optional pointer to internal TDR recovery context 
                        (TDR_RECOVERY_CONTEXT).
Arg2: fffff880101cc360, The pointer into responsible device driver module 
                        (e.g. owner tag).
Arg3: 0000000000000000, Optional error code (NTSTATUS) of the last failed 
                        operation.
Arg4: 0000000000000002, Optional internal context dependent data.

Debugging Details:
------------------


FAULTING_IP: 
nvlddmkm+114360
fffff880`101cc360 803d393cb80000  cmp     byte ptr 
                           [nvlddmkm+0xc97fa0 (fffff880`10d4ffa0)],0

DEFAULT_BUCKET_ID:  GRAPHICS_DRIVER_TDR_FAULT

CUSTOMER_CRASH_COUNT:  1

BUGCHECK_STR:  0x116

PROCESS_NAME:  System

CURRENT_IRQL:  0

STACK_TEXT:  
... : nt!KeBugCheckEx
... : dxgkrnl!TdrBugcheckOnTimeout+0xec
... : dxgkrnl!TdrIsRecoveryRequired+0x1a2
... : dxgmms1!VidSchiReportHwHang+0x40b
... : dxgmms1!VidSchiCheckHwProgress+0x71
... : dxgmms1!VidSchiWaitForSchedulerEvents+0x1fb
... : dxgmms1!VidSchiScheduleCommandToRun+0x1da
... : dxgmms1!VidSchiWorkerThread+0xba
... : nt!PspSystemThreadStartup+0x5a
... : nt!KxStartSystemThread+0x16


STACK_COMMAND:  .bugcheck ; kb

FOLLOWUP_IP: 
nvlddmkm+114360
fffff880`101cc360 803d393cb80000  cmp     byte ptr 
                        [nvlddmkm+0xc97fa0 (fffff880`10d4ffa0)],0

SYMBOL_NAME:  nvlddmkm+114360

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: nvlddmkm

IMAGE_NAME:  nvlddmkm.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  4baa0110

FAILURE_BUCKET_ID:  X64_0x116_IMAGE_nvlddmkm.sys

BUCKET_ID:  X64_0x116_IMAGE_nvlddmkm.sys

Followup: MachineOwner
---------

0: kd> !sysinfo machineid
Machine ID Information [From Smbios 2.5, DMIVersion 0, Size=1323]
BiosVendor = Alienware
BiosVersion = A04
BiosReleaseDate = 04/28/2010
SystemManufacturer = Alienware
SystemProductName = M17x          
SystemVersion = A0423
BaseBoardManufacturer = Alienware
BaseBoardProduct =       
BaseBoardVersion = A04 
 
We can tell one thing: We had a crash (duh...). The BIOS is relatively new (for the time) and the NVIdia drivers were likely up to date when this system crashed (we can use the lm vm nvlddmkm command to see this),

0: kd> lmvm nvlddmkm
start             end                 module name
fffff880`100b8000 fffff880`10de4d80   nvlddmkm T (no symbols)           
    Loaded symbol image file: nvlddmkm.sys
    Image path: \SystemRoot\system32\DRIVERS\nvlddmkm.sys
    Image name: nvlddmkm.sys
    Timestamp:        Wed Mar 24 06:09:52 2010 (4BAA0110)
    CheckSum:         00D3D0C3
    ImageSize:        00D2CD80
    Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4
 
 
So now we need to really dig into the error and figure out what windbg/kd are trying to tell us about what happened. Microsoft's one-liner description is fairly cryptic: "Attempt to reset the display driver and recover from timeout failed." So what does this mean?

One of the complaints with Windows (or really any other operating system) is that the screen freezes from time to time. If the screen freezes for more than a few seconds, users are likely to hard reset the machine that they are working on. This seems natural, but in this case the system is still responsive. The graphics processing unit (GPU) is busy processing something (possibly a game, 3D render, or even Windows Aero) and is not actively refreshing the screen.

In Windows Vista SP1 and Windows Server 2008 SP1 Microsoft introduced a feature to help catch and correct this behavior using a feature called "Timeout Detection and Recovery (TDR)." The TDR feature works to identify whether the graphics processor is hung (the default timeout is 2 seconds), and if it is, it prepares to reset the graphics processor and the relevant part of the graphics stack. During this process, it tells the driver not to access the hardware or memory and gives it a short time for currently running threads to leave the driver. If the threads do not leave within the timeout, then the system bug checks with 0x116 VIDEO_TDR_FAILURE. The system can also bug check with VIDEO_TDR_FAILURE if a number of TDR events occur in a short period of time (the default is 5 TDRs in 1 minute). If the TDR is successful, then the user may receive a bubble that says "Display driver stopped responding and has recovered."

There are some registry keys that can be used to control the behavior of TDR and may be useful for additional testing and troubleshooting. These keys are all in HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers. These keys/values may need to be created if they are not there, but if they are missing then they are using the default value. A full list is here on MSDN, but I will explain a couple that are likely to be the most applicable. The usual warnings apply with editing the registry: Be sure that you know what you are doing and if you get into trouble, either restore from backup or try to back out the change in Safe Mode.

The first key determines whether TDR is enabled or disabled and what it actually does when it detects a timeout:

Value Name TdrLevel
Type REG_DWORD
Possible Values 0 - Detection is Disabled
1 - Bug Check on Timeout
3 - Recover on Timeout (default)

This value may be useful to disable TDR. This would be done in the case that the graphics hardware and display adapter simply do not play nicely with TDR and that the GPU/Driver will recover on their own. Ultimately, if the driver/GPU don't recover after a hang, then the system will appear to be frozen and will not bug check on its own. This is the main registry key that I think might be helpful, but I'll also mention a couple of others.

TdrDelay (REG_DWORD, default value = 2) is used to change the timeout period from 2 seconds to a different number of seconds. This would be useful in the instance that the GPU takes 3 seconds to recover (instead of 2).

TdrLimitTime  (REG_DWORD, default value = 60) and TdrLimitCount (REG_DWORD, default value = 5) changes the behavior of allowing a smaller number or larger number of TDRs in a specific time period. The main usefulness here would be if the crash can be tuned out of the system by adjusting these parameters.

Other ideas for troubleshooting:
  • Disable any overclocking on the system or graphics processor
  • Ensure that your power supply is sufficient to handle the motherboard, processors, video card, and all other devices
  • Ensure that all power connections are firmly in place on the motherboard and video card (some video cards require additional power and have specific ports that need to receive direct power from the power supply)
  • Verify that the video card is fully inserted and secured
  • Verify that no other wires or materials are laying on the video card
  • Verify that the system and video card are adequately cooled, overheating graphics cards can cause serious hangs/crashes.
  • Verify that DirectX and OpenGL are up to date and any graphics intensive applications (such as games) are fully patched
I wouldn't pin this problem on Microsoft. Ultimately, this crash is due to game/software developers and graphics card manufacturers (such as ATI/AMD and NVidia) developing buggy devices and software and not playing by the rules and standards dictated for a specific platform like Windows. There are many cases of similar events happening on UNIX/Linux systems, so this problem is not specifically isolated to Windows. 

See Also,
Windows Crash Dump Analysis
Stress Testing a Video Card



2 comments:

  1. Thanks for the tip. I have a brand new machine with new AMD video and am getting 116 and 117 BSODs occasionally.

    I'm going to disagree with you that the blame shouldn't be on Microsoft. There has to be a better way to handle this condition than a BSOD. Great way to lose data!

    ReplyDelete
    Replies
    1. Unfortunately, the way it works, the two options are for the system to crash or appear to be totally unresponsive...

      Delete