| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
| |
The purpose of these ops is to fix a latency problem observed for an inference benchmark. Often a inference step starts by reading the value of many (hundreds) of weights. For a resource variable, this requires a VarHandleOp and a ReadVariableOp per variable. Running hundreds of trivial ops can add hundreds of microseconds of latency to the critical path of an inference step. The inter-op latency of the executor can be hundreds of nanoseconds, which rapidly adds up.
This change introduces two fused ops _VarHandlesOp and _ReadVariablesOp that allow us to read many variables in a pair of larger ops, rather than many tiny ops.
PiperOrigin-RevId: 214662338
|
|
|
|
|
|
|
| |
Before this change, we were not releasing device memory
allocated by ResourceVariables.
PiperOrigin-RevId: 204329027
|
|
Before this change, when we executed a naked variable read (i.e. outside of
a defun, directly running <xla_device>->Compute()), tf2xla kernel would
copy the variable's tensor leading to many unnecessary copies.
This change uses the regular non-tf2xla kernel for naked variable reads
and marks the tf2xla one for CompilationOnly().
PiperOrigin-RevId: 197976146
|