Hardlinks unter Linux

Code-Stücke können hier veröffentlicht werden.
Antworten
jerch
User
Beiträge: 1669
Registriert: Mittwoch 4. März 2009, 14:19

Weil ich es selber für ein Backupskript mit rsync brauchte, hier eine Klasse zum Finden und Weiterreichen der Information über Hardlinks unter Linux:

Code: Alles auswählen

import os
import re
from copy import copy
from subprocess import Popen, PIPE
from json import dumps, loads
from collections import Counter
from itertools import imap
from operator import itemgetter, gt, lt, eq

NULL = open(os.devnull, 'w')
REX_FILTER = re.compile('([+-]*)(\d+)')
SIGNS = {'+': gt, '-': lt, '': eq}
CALL = ['find', None, '-xdev', '-type', 'f',
        '-links', None, '-printf', '%i: %n: %p\n']


class Hardlinks(object):
    """
    Searches `path` for hardlinks with number `filter`. Filter is a string
    holding a number or a number with a plus or minus sign (+ for greater than,
    - for lesser than).
    Use `load` to load values from another serialized Hardlinks object.
    """
    def __init__(self, path='/', filter='+1', load=None):
        if load:
            self.loads(load)
        else:
            self.call = copy(CALL)
            self.call[1] = path
            self.call[6] = filter
            self.inode2paths = {}
            self.path2inode = {}
            self._run()

    def _run(self):
        p = Popen(self.call, stdout=PIPE, stderr=NULL)
        for line in p.stdout:
            inode, links, path = line.rstrip().split(': ')
            inode = int(inode)
            links = int(links)
            path = os.path.abspath(path)
            self.inode2paths.setdefault(inode, (links, []))[1].append(path)
            self.path2inode[path] = inode

    @property
    def path(self):
        """
        Returns the used `path` argument.
        """
        return self.call[1]

    @property
    def filter(self):
        """
        Returns the used `filter` argument.
        """
        return self.call[6]

    @property
    def count_hardlinks(self):
        """
        Number of inodes matching the filter in path.
        """
        return len(self.inode2paths)

    @property
    def count_files(self):
        """
        Number of hardlinked files matching the filter in path.
        """
        return len(self.path2inode)

    @property
    def distribution(self):
        """
        Distribution of number of hardlinks in path.
        """
        return Counter(imap(itemgetter(0), self.inode2paths.values()))

    def test_path(self, path):
        """
        Returns a list of hardlinked paths of `path`.
        """
        path = os.path.abspath(path)
        try:
            return self.inode2paths[self.path2inode[path]]
        except KeyError:
            return []

    def dumps(self):
        """
        Serializes the internal data for later usage.
        """
        return dumps((self.call, self.inode2paths, self.path2inode))

    def loads(self, s):
        """
        Loads internal data from `s`.
        """
        self.call, self.inode2paths, self.path2inode = loads(s)

    def get_paths(self, filter='+1'):
        """
        Returns number of hardlinks and pathlists for `filter`.
        This uses the internal cached data to avoid rescanning of the
        filesystem.
        Note: If the number of hardlinks does not match the length of
        pathlist you missed a file outside the search path.
        """
        sign, num = re.match(REX_FILTER, filter).groups()
        num = int(num)
        compare = SIGNS[sign]
        for count, paths in self.inode2paths.values():
            if compare(count, num):
                yield count, paths
BlackJack

@jerch: Nur so aus Neugierde: Gibt's einen Grund für ``find`` statt `os.walk()` und `os.stat()`?
jerch
User
Beiträge: 1669
Registriert: Mittwoch 4. März 2009, 14:19

@BlackJack:
Ich hatte es erst mit `os.walk`, was aber deutlich langsamer war:

Code: Alles auswählen

#> time python hardlinks_walk.py
/ +1 68 240
Counter({4: 48, 2: 14, 3: 5, 5: 1})

real    0m3.601s
user    0m2.489s
sys     0m1.110s
Versus ``find``:

Code: Alles auswählen

#> time python hardlinks_find.py
/ +1 68 240
Counter({4: 48, 2: 14, 3: 5, 5: 1})

real    0m0.880s
user    0m0.364s
sys     0m0.517s
Mehrfach getestet mit meiner SSD (Xubuntu mit etwa 25GB und ~300k Dateien auf root).

Mit normaler Festplatte dürfte der Effekt kaum noch zu sehen sein, da die Trägheit der Platte weitaus größeren Einfluss hat.
Antworten