2008-10-29 00:01 -!- pgquiles__(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-10-29 00:12 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-29 00:19 -!- pgquiles_(~pgquiles@19.Red-83-44-236.dynamicIP.rima-tde.net) has joined #tux3 2008-10-29 01:19 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-29 04:44 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-29 06:15 -!- FelipeS(~Felipe@lawn-128-61-120-139.lawn.gatech.edu) has joined #tux3 2008-10-29 06:15 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-29 06:21 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-29 07:26 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-29 07:50 -!- pgquiles(~pgquiles@19.Red-83-44-236.dynamicIP.rima-tde.net) has joined #tux3 2008-10-29 08:39 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-29 08:47 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-29 08:56 -!- RzM|Away(~razvan@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-10-29 10:27 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-10-29 11:49 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-29 11:57 hirofumi, yes 2008-10-29 11:58 oh, great 2008-10-29 11:58 I was thinking about it last week 2008-10-29 11:58 next commit will add date handling, after that it's all atomic commit work 2008-10-29 11:59 I'm writing a post to clarify a few details at the moment 2008-10-29 11:59 it's a fun thing to think about, first new approach to the problem in 15 years 2008-10-29 12:00 the method needs a name 2008-10-29 12:00 isn't it atomic commit? 2008-10-29 12:00 a new kind of atomic commit 2008-10-29 12:00 the first kind used in filesystems was journalling 2008-10-29 12:01 then came logging and tree-based copy on write 2008-10-29 12:01 recursive copy on write 2008-10-29 12:01 ah, yes. 2008-10-29 12:01 this is non-recursive copy on write 2008-10-29 12:01 but that would be a lame name 2008-10-29 12:02 something to think about over the next couple weeks 2008-10-29 12:02 yes, atomic commit. it seems too generic 2008-10-29 12:03 btw, in rollup, do we need to write out modified btree-index? 2008-10-29 12:05 rollup writes out previously modified btree nodes 2008-10-29 12:05 i see 2008-10-29 12:05 and may at the same time modify more btree nodes, which will be written in a future rollup 2008-10-29 12:06 it will be recursive way to root? 2008-10-29 12:07 it eventually goes to the root, yes, but does not create a new root 2008-10-29 12:07 i see. how do we handle root? 2008-10-29 12:07 so there are two essential differences from recursive tree copy on write: 1) the updates are spread out in time, they don't happen on each leaf write 2) does not generate new trees 2008-10-29 12:08 or rather, does not generate multiple trees 2008-10-29 12:08 we have a few fixed locations for root 2008-10-29 12:08 and a sequence number 2008-10-29 12:08 I need to write that in a design note 2008-10-29 12:09 root is modified very rarely 2008-10-29 12:09 oh, i see. tux3 can merge btree-index modification in multiple phase? 2008-10-29 12:09 generally only when the inode table btree index needs an additional level 2008-10-29 12:10 it can 2008-10-29 12:10 um.. 2008-10-29 12:11 the question of whether the current tree state is represented via promises in commit blocks or actual written out index blocks is orthagonal to the phase mechanism 2008-10-29 12:12 orthagonal? 2008-10-29 12:12 ACTION my english skill is too poor 2008-10-29 12:13 "does not affect" 2008-10-29 12:13 "one can be changed without affecting the other" 2008-10-29 12:14 i see. 2008-10-29 12:14 your english skill is fine, I didn't even notice you're not a native speaker 2008-10-29 12:14 oh, it's surprise to me 2008-10-29 12:15 thanks. 2008-10-29 12:16 i'm still thinking about rollup stage... 2008-10-29 12:16 in "Cache state reconstruction" 2008-10-29 12:16 section 2008-10-29 12:17 it says parent blocks of rolled up, will via promises recorded 2008-10-29 12:19 it means parent block will be copy-on-write, then it will be written to new location as new block? 2008-10-29 12:22 yes 2008-10-29 12:22 i see 2008-10-29 12:23 in fact, there is not a copy on wirte 2008-10-29 12:23 because the buffer is in cache 2008-10-29 12:23 ah 2008-10-29 12:23 the block buffer is simply assigned to a new location 2008-10-29 12:23 and the new location becomes a promise 2008-10-29 12:24 actual copy on write of buffers does happen, but it is for a different purpose: to prevent stalls in writing by userspace programs 2008-10-29 12:25 um..., but new one is block on stable image + previous promise 2008-10-29 12:25 yes 2008-10-29 12:26 if I think stable image is original, and new one can be called copy-on-write? 2008-10-29 12:26 except no copy is done 2008-10-29 12:26 so without a copy, it isn't copy on write 2008-10-29 12:27 a better term is redirect on write 2008-10-29 12:27 now that I think of it, copy on write is incorrect terminology for the algorithm used by btrfs 2008-10-29 12:27 well, let me think about that 2008-10-29 12:28 i see 2008-10-29 12:28 depends how they actually implement it 2008-10-29 12:29 physical remmapping is done in buffer cache? 2008-10-29 12:30 yes 2008-10-29 12:30 during normal operation what we do is make the normal modification to the cached image of index block just as it is implemented now, and at the same time, write a promise to modify the physical block into a commit block 2008-10-29 12:30 rollup does not apply promises, because they are already applied 2008-10-29 12:30 only recover does 2008-10-29 12:30 only recovery does 2008-10-29 12:31 however, rollup optimize(?) promises? 2008-10-29 12:32 i mean it will rewrite/merge dirty index blocks 2008-10-29 12:32 rollup writes out the dirty index block, making the promises no longer necessary, so they can be discarded 2008-10-29 12:32 yes 2008-10-29 12:32 i see 2008-10-29 12:32 what I realized a couple days ago is that promises can be retired out of order 2008-10-29 12:33 and so we need a way to know which promises don't need to be applied any more, because the index block they refer to was already written out 2008-10-29 12:34 um... 2008-10-29 12:34 we don't know in advance what order the index blocks will be written out, because it depends on the pattern of filesystem activity 2008-10-29 12:34 it doesn't have dependency? 2008-10-29 12:34 dependency on what? 2008-10-29 12:35 e.g. previous phase may have parent directory of current phase? 2008-10-29 12:35 did you mean the word "directory" ? 2008-10-29 12:36 directory entry 2008-10-29 12:36 a changed directory entry must be written out in the same phase as the changed inode table block 2008-10-29 12:37 that is a fule that guarantees atomicity 2008-10-29 12:37 a rule 2008-10-29 12:37 we don't actually analyze those dependencies 2008-10-29 12:37 um.. 2008-10-29 12:38 but instead, just see what buffers the filesystem operation changes 2008-10-29 12:38 and add those changed buffers to the current phase 2008-10-29 12:38 previous phase has "foo", and next one has "foo/bar"? 2008-10-29 12:39 there can be a commit between creating foo and foo/bar, that is ok 2008-10-29 12:39 yes 2008-10-29 12:40 but reverse order of phase, bar is orphaned entry? 2008-10-29 12:40 but that can't happen because the phases cannot be completed out of order 2008-10-29 12:41 ah, maybe i missread "retired out of order" 2008-10-29 12:42 right, it's just the promises that can be retired out of order 2008-10-29 12:42 i see 2008-10-29 12:42 the order in which promises can be retired (that is, ignored on recovery) depends on the order in which we add dirty index blocks to the active phase 2008-10-29 12:43 that is a key point I need to mention: we do not normally add dirty index blocks to the active phase 2008-10-29 12:45 umm.. hard to understand yet for me unfortunately 2008-10-29 12:45 but I think you are the closest to understanding 2008-10-29 12:46 thanks, I hope 2008-10-29 12:46 so 2008-10-29 12:46 the reason we don't add dirty index blocks to the active phase is, that would defeat the optimization we do with the promises 2008-10-29 12:46 so instead, we only add them on split or rollup 2008-10-29 12:48 um.. what means "don't add"? 2008-10-29 12:48 we don't dirty those? 2008-10-29 12:48 each phase has a list of buffers that belong to it, and will be written to disk in that phase 2008-10-29 12:48 ah 2008-10-29 12:49 delay? 2008-10-29 12:49 which delay? 2008-10-29 12:49 delay to add dirty index blocks? 2008-10-29 12:50 there are dirty blocks not added to a phase, these are the blocks that need to be reconstructed from promises on recovery 2008-10-29 12:50 i see 2008-10-29 12:51 speaking of delay... a phase cannot begin to be written to disk until the next phase has started 2008-10-29 12:52 however, it can be by timeout? 2008-10-29 12:52 ah, yes 2008-10-29 12:52 I think a better trigger is, write queue on the underlying device nearly empty 2008-10-29 12:53 oh, i see 2008-10-29 12:53 so that when the device is not doing anything, at the start of an untar for example, the first phase will be very short 2008-10-29 12:54 sounds very good 2008-10-29 12:54 yes, good for throughput 2008-10-29 12:54 yes 2008-10-29 12:55 btw, in "Phase transition" section 2008-10-29 12:55 Starting a new phase requires incrementing the phase counter in the 2008-10-29 12:55 cached filesystem superblock and flushing all dirty inodes. 2008-10-29 12:55 in this section, "flushing all dirty inodes" means ->write_inode on linux 2008-10-29 12:55 ? 2008-10-29 12:56 but I meant also flushing any dirty blocks (pages in kernel) cached by the inode 2008-10-29 12:57 i see. actual write out... 2008-10-29 12:57 yes, I should have been more specific 2008-10-29 12:58 in starting a new phase, we need to flush dirty buffers? 2008-10-29 12:59 to start a new phase 2008-10-29 12:59 flush the buffers dirtied in the previous phase 2008-10-29 12:59 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-29 13:00 well, flush the _data_ buffers dirtied in the previous phase 2008-10-29 13:00 it's better to say, flush the dirty inode blocks 2008-10-29 13:00 ah, i see 2008-10-29 13:00 ordered-write mode? 2008-10-29 13:01 this is more strict than ordered-write 2008-10-29 13:01 because we write the data blocks to new locations that don't overwrite data in a previous phase 2008-10-29 13:01 it's like data=journal 2008-10-29 13:02 has the same effect, but without writing twice 2008-10-29 13:02 i see 2008-10-29 13:03 I thought data buffers may be in place update 2008-10-29 13:05 thanks. I belive my understanding became more good 2008-10-29 13:09 we might add in place update later as an additional mode 2008-10-29 13:09 like ordered data 2008-10-29 13:11 the advantage is not in terms of speed, because in both cases we write the new blocks only once, but in reducing fragmentation because the data does not have to be relocated 2008-10-29 13:11 yes 2008-10-29 13:12 for a solid state disk, the advantage is very little 2008-10-29 13:12 so we will always want our strict mode for ssd I think 2008-10-29 13:13 ah, yes. it may be important in future 2008-10-29 13:13 I have an ssd now :) 2008-10-29 13:13 my eee 2008-10-29 13:13 oh, too fast :) 2008-10-29 13:14 we don't have good fs for it yet :) 2008-10-29 13:16 btw, are you already thinking about locking rules? 2008-10-29 13:16 yes 2008-10-29 13:16 in some depth 2008-10-29 13:16 oh, great 2008-10-29 13:17 I'm ignore about it for now 2008-10-29 13:17 that's reasonable 2008-10-29 13:17 we can start with a simple lock 2008-10-29 13:18 e.g. per btree big lock? 2008-10-29 13:18 yes 2008-10-29 13:18 i see 2008-10-29 13:18 per inode, the easist thing 2008-10-29 13:19 and one more for modifying the inode table 2008-10-29 13:19 and may be for bitmap? 2008-10-29 13:19 yes 2008-10-29 13:19 allocation lock 2008-10-29 13:19 i see 2008-10-29 13:20 ah 2008-10-29 13:20 in phase transision, we modify bitmap for commit blocks etc.? 2008-10-29 13:20 yes 2008-10-29 13:20 and bitmap change will be written to same phase? 2008-10-29 13:20 commit blocks are marked as allocated to prevent them from being allocated for other purposes 2008-10-29 13:21 good question 2008-10-29 13:21 it's probably most efficient to write it to the same phase 2008-10-29 13:22 well 2008-10-29 13:22 good question :) 2008-10-29 13:22 i see. I thought it may be in phase commit or related blocks 2008-10-29 13:23 there will be multiple commit blocks per phase 2008-10-29 13:23 so phase commit points those blocks? 2008-10-29 13:24 each commit block points to some number of flushed blocks 2008-10-29 13:24 as many as will fit in the commit 2008-10-29 13:25 yes 2008-10-29 13:25 and all the flushed blocks, plus all the commit blocks, have to be completely written before the commit block for the phase is written 2008-10-29 13:25 it might be better to call those multiple commit blocks, log blocks 2008-10-29 13:25 and reserve the term commit block for the phase commit block 2008-10-29 13:26 sounds good 2008-10-29 13:26 I think that the dirty bitmaps for the log blocks have to be in the same phase, yes 2008-10-29 13:26 i see. how about commit block? 2008-10-29 13:27 commit block also allocate new block? 2008-10-29 13:27 yes 2008-10-29 13:28 I don't see a clear reason why it has to be in the allocation map of its own phase, or in the next phase 2008-10-29 13:29 um.. 2008-10-29 13:30 if crashed, we don't now free blocks until trace phases? 2008-10-29 13:30 now -> know 2008-10-29 13:30 we fall back to the last completed phase 2008-10-29 13:30 which means we found the commit block, and we know that it is allocated 2008-10-29 13:31 ah 2008-10-29 13:32 in recovery, we will mark those as allocated? 2008-10-29 13:33 yes 2008-10-29 13:33 part of reconstructing dirty metadata 2008-10-29 13:33 the phase commit block can be freed after the next phase completes 2008-10-29 13:34 one thing we could do in future, is allow more than one phase on disk 2008-10-29 13:34 yes 2008-10-29 13:34 this gives a very limited form of versioning 2008-10-29 13:35 at the expense of making allocation decisions more difficult 2008-10-29 13:35 it might be useful for something 2008-10-29 13:35 i see 2008-10-29 14:22 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-29 19:37 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-29 20:40 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-29 20:41 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-29 22:11 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3