doc/bugs/Allow_automatic_retry_git_annex_get/comment_2_6d05cd09e1f00fb5ace2b9ae3bffdedb._comment


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66

[[!comment format=mdwn
 username="joey"
 subject="""comment 2"""
 date="2016-10-26T18:26:35Z"
 content="""
The most common way a network connection can stall like this is when
moving to a different wifi network: the connection is open but
no more data will be received. I suppose other kinds of network
glitches could also lead to this kind of situation.

ssh has some things, like ServerAliveInterval and TCPKeepAlive, 
that it can use to detect such problems. You may find them useful.

----

As for the retrying once a stall is detected, some transfers use
`forwardRetry` which will automatically retry as long as the failed try
managed to send some data. But the get/move/copy commands currently use
`noRetry`. I can't find any justification for not always using
`forwardRetry`; I think that it was added for the assistant originally and
the other stuff just never switched over.

Only problem I can think of is, if there actually is a ssh password 
prompt, it would prompt again on retry. But most people using git-annex
with ssh have something in place to make ssh not prompt repeatedly for
passwords.

So, I've gone ahead and enabled `forwardRetry` for everything.

----

Occurs to me that git-annex could try to notice when a transfer is not
progressing, by reusing the existing progress metering code.

Since some remotes don't update the progress meter, this could
only be used to detect stalls after the progress meter has been updated
at least once. If the stall occurs earlier than that, it would not be able
to be detected.

It seems quite hard to come up with a good timeout value to detect a
stalled connection. Often progress meters are updated after every small
(eg 32kb) chunk transferred. But others might poll periodically, or might
use a larger chunk size. It's even possible that some special remotes
are looking at a percent output by some program, and only update the meter
when the percent transferred changes -- in which case it could be many
minutes in between each meter update when a large file is being
transferred.

If the timeout is too short, git-annex will stall in a new way, by
constantly killing "stalled" connections before they can send enough data.

----

So it really seems better to fix the ssh connection to not stall, since
that is not so heuristic a fix. Seems like git-annex could force
ServerAliveInterval to be set, and perhaps lower ServerAliveCountMax from 3
to 1. The ssh BatchMode setting sets the former to 300, so a stalled
connection will time out after 15 minutes. But BatchMode also disables
prompting, and git-annex should not disable that.

Catch is, what if the user has configured ssh with some
other ServerAliveInterval value? We don't want git-annex to override that.

(git-annex does have a rudimentary .ssh/config parser, but it's not
good enough to handle eg, "Host *.example.com ")
"""]]