summaryrefslogtreecommitdiff
path: root/doc/design/assistant/syncing
diff options
context:
space:
mode:
Diffstat (limited to 'doc/design/assistant/syncing')
-rw-r--r--doc/design/assistant/syncing/comment_1_c70156174ff19b503978d623bd2df36f._comment19
-rw-r--r--doc/design/assistant/syncing/comment_2_eb992b5b2c7a5ce23443e2a6007e5ff9._comment8
-rw-r--r--doc/design/assistant/syncing/comment_3_e1b5e8a24556de16d1cacd27ee0c1bd1._comment80
-rw-r--r--doc/design/assistant/syncing/efficiency.mdwn73
4 files changed, 180 insertions, 0 deletions
diff --git a/doc/design/assistant/syncing/comment_1_c70156174ff19b503978d623bd2df36f._comment b/doc/design/assistant/syncing/comment_1_c70156174ff19b503978d623bd2df36f._comment
new file mode 100644
index 000000000..019490e61
--- /dev/null
+++ b/doc/design/assistant/syncing/comment_1_c70156174ff19b503978d623bd2df36f._comment
@@ -0,0 +1,19 @@
+[[!comment format=mdwn
+ username="https://www.google.com/accounts/o8/id?id=AItOawk4YX0PWICfWGRLuncCPufMPDctT7KAYJA"
+ nickname="betabrain"
+ subject="selective data syncing"
+ date="2012-07-24T15:27:08Z"
+ content="""
+How will the assistant know which files' data to distribute between the repos?
+
+I'm using git-annex and it's numcopies attribute to maintain a redundant archive spread over different computers and usb drives. Not all drives should get a copy of everything, e.g. the usb drive I take to work should not automatically get a copy of family pictures.
+
+How about .gitattributes?
+
+* \* annex.auto-sync-data = false # don't automatically sync the data
+* archive/ annex.auto-push-repos = NAS # everything added to archive/ in any repo goes automatically to the NAS remote.
+* work/ annex.auto-synced-repos = LAPTOP WORKUSB # everything added to work/ in LAPTOP or WORKUSB gets synced to WORKUSB and LAPTOP
+* work/ annex.auto-push-repos = LAPTOP WORKUSB # stuff added to work/ anywhere gets synced to LAPTOP and WORKUSB
+* important/ annex.auto-sync-data = true # push data to all repos
+* webserver_logs/ annex.remote.WEBSERVER.auto-push-repos = S3 # only the assistant running in WEBSERVER pushes webserver_logs/ to S3 remote
+"""]]
diff --git a/doc/design/assistant/syncing/comment_2_eb992b5b2c7a5ce23443e2a6007e5ff9._comment b/doc/design/assistant/syncing/comment_2_eb992b5b2c7a5ce23443e2a6007e5ff9._comment
new file mode 100644
index 000000000..a4609d7e1
--- /dev/null
+++ b/doc/design/assistant/syncing/comment_2_eb992b5b2c7a5ce23443e2a6007e5ff9._comment
@@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="https://www.google.com/accounts/o8/id?id=AItOawnBl7cA6wLDxVNUyLIHvAyCkf8ir3alYpk"
+ nickname="Tyson"
+ subject="Bridging LANs"
+ date="2012-07-10T10:20:59Z"
+ content="""
+Why rely on the cloud when you can instead use XMPP and jingle to perform NAT traversal for you? AFAIKT, it also means that traffic won't leave your router if the two endpoints are behind the same router.
+"""]]
diff --git a/doc/design/assistant/syncing/comment_3_e1b5e8a24556de16d1cacd27ee0c1bd1._comment b/doc/design/assistant/syncing/comment_3_e1b5e8a24556de16d1cacd27ee0c1bd1._comment
new file mode 100644
index 000000000..c9118595c
--- /dev/null
+++ b/doc/design/assistant/syncing/comment_3_e1b5e8a24556de16d1cacd27ee0c1bd1._comment
@@ -0,0 +1,80 @@
+[[!comment format=mdwn
+ username="https://www.google.com/accounts/o8/id?id=AItOawkSq2FDpK2n66QRUxtqqdbyDuwgbQmUWus"
+ nickname="Jimmy"
+ subject="comment 1"
+ date="2012-07-03T08:26:43Z"
+ content="""
+On \"git syncing\" point number 9, on OSX you could potentially do this on a semi-regular basis
+
+<pre>
+system_profiler SPNetworkVolumeDataType
+Volumes:
+
+ net:
+
+ Type: autofs
+ Mount Point: /net
+ Mounted From: map -hosts
+ Automounted: Yes
+
+ home:
+
+ Type: autofs
+ Mount Point: /home
+ Mounted From: map auto_home
+ Automounted: Yes
+</pre>
+
+and
+
+<pre>
+x00:~ jtang$ system_profiler SPUSBDataType
+USB:
+
+ USB High-Speed Bus:
+
+ Host Controller Location: Built-in USB
+ Host Controller Driver: AppleUSBEHCI
+ PCI Device ID: 0x0aa9
+ PCI Revision ID: 0x00b1
+ PCI Vendor ID: 0x10de
+ Bus Number: 0x26
+
+ Hub:
+
+ Product ID: 0x2504
+ Vendor ID: 0x0424 (SMSC)
+ Version: 0.01
+ Speed: Up to 480 Mb/sec
+ Location ID: 0x26200000 / 3
+ Current Available (mA): 500
+ Current Required (mA): 2
+
+ USB to ATA/ATAPI Bridge:
+
+ Capacity: 750.16 GB (750,156,374,016 bytes)
+ Removable Media: Yes
+ Detachable Drive: Yes
+ BSD Name: disk1
+ Product ID: 0x2338
+ Vendor ID: 0x152d (JMicron Technology Corp.)
+ Version: 1.00
+ Serial Number: 313541813001
+ Speed: Up to 480 Mb/sec
+ Manufacturer: JMicron
+ Location ID: 0x26240000 / 5
+ Current Available (mA): 500
+ Current Required (mA): 2
+ Partition Map Type: MBR (Master Boot Record)
+ S.M.A.R.T. status: Not Supported
+ Volumes:
+ Porta-Disk:
+ Capacity: 750.16 GB (750,156,341,760 bytes)
+ Available: 668.42 GB (668,424,208,384 bytes)
+ Writable: Yes
+ File System: ExFAT
+....
+</pre>
+
+I think its possible to programatically get this information either from the CLI (it dumps out XML output if required) or some development library. There is also DBUS in macports, but I have never had much interaction with it, so I don't know if its good or bad on OSX.
+"""]]
diff --git a/doc/design/assistant/syncing/efficiency.mdwn b/doc/design/assistant/syncing/efficiency.mdwn
new file mode 100644
index 000000000..7da721a2c
--- /dev/null
+++ b/doc/design/assistant/syncing/efficiency.mdwn
@@ -0,0 +1,73 @@
+Currently, the git-annex assistant syncs with remotes in a way that is
+dumb, and potentially inneficient:
+
+1. Files are transferred to each reachable remote whose
+ [[preferred_content]] setting indicates it wants the file.
+
+2. After each file transfer (upload or download), a git sync
+ is done to all the remotes, to update location log information.
+
+## unncessary transfers
+
+There are network toplogies where #1 is massively inneficient.
+For example:
+
+<pre>
+ laptopA-----laptopB-----laptopC
+ \ | /
+ \---cloud based repo--/
+</pre>
+
+When laptopA has a new file, it will first send it to laptopB. It will then
+check if the cloud based transfer repository wants a copy. It will, because
+laptopC has not yet gotten a copy. So laptopA will proceed with a slow
+upload to the cloud, while meanwhile laptopB is sending the file over fast
+LAN to laptopC.
+
+(The more common case with no laptopC happens to work efficiently.
+So does the case where laptopA is paired with laptopC.)
+
+## unncessary syncing
+
+Less importantly, the constant git syncing after each transfer is rather a
+lot of work, and prevents collecting multiple presence changes to the git-annex
+branch into larger commits, which would save disk space over time.
+
+In many cases, this sync is necessary. For example, when a file is uploaded
+to a transfer remote, the location change needs to be synced out so that
+other clients know to grab it.
+
+Or, when downloading a file from a drive, the sync lets other locally
+paired repositories know we got it, so they can download it from us.
+OTOH, this is also a case where a sync is sometimes unnecessary, since
+if we're going to upload the file to them after getting it, the sync
+only perhaps lets them start downloading it before our transfer queue
+reaches a point where we'd upload it.
+
+It would be good to find a way to detect when syncing is not immediately
+necessary, and defer it.
+
+## mapping
+
+Mapping the repository network has the potential to get git-annex the
+information it needs to avoid unnecessary transfers and/or unncessary
+syncing.
+
+Mapping the network can reuse code in `git annex map`. Once the map is
+built, we want to find paths through the network that reach all nodes
+eventually, with the least cost. This is a minimum spanning tree problem,
+except with a directed graph, so really a Arborescence problem.
+
+A significant problem in mapping is that nodes are mobile, they can move
+between networks over time. This breaks LAN based paths through the
+network. Mapping would need a way to detect this. Note that individual
+git-annex assistants can tell when they've switched networks by using the
+`networkConnectedNotifier`.
+
+## P2P signaling
+
+Another approach that might help with these problems is if git-annex
+repositories have a non-git out of band signaling mechanism. This could,
+for example, be used by laptopB to tell laptopA that it's trying to send
+a file directly to laptopC. laptopA could then defer the upload to the
+cloud for a while.