Python语法练习（解决各种小问题）

下面练习练习Python语法，做几个小功能。

1. 合并Nginx访问日志IP，根据访问次数降序

模拟的日志IP（实际中只需要处理一下原始日志）

$ cat access.log
10.10.0.1
10.10.0.2
10.10.0.1
10.10.0.2
10.10.0.3
10.10.0.1
10.10.0.2
10.10.0.3
10.10.0.4

$ cat access.log

10.10.0.1

10.10.0.2

10.10.0.1

10.10.0.2

10.10.0.3

10.10.0.1

10.10.0.2

10.10.0.3

10.10.0.4

代码如下：

count = dict()

# 把IP和统计次数封包成一个字典;
with open('access.log') as f:
    for line in f:
        if line not in count.keys():
            count[line] = (0)
        count[line] = (count[line] + 1)

# 进行解包处理,然后利用sorted函数进行排序;
for ip, count in sorted(count.items(), key=lambda x: x[1], reverse=True)[0:]:
    print('%s => %d' % (ip, count))

count = dict()

# 把IP和统计次数封包成一个字典;

with open('access.log') as f:

for line in f:

if line not in count.keys():

count[line] = (0)

count[line] = (count[line] + 1)

# 进行解包处理,然后利用sorted函数进行排序;

for ip, count in sorted(count.items(), key=lambda x: x[1], reverse=True)[0:]:

print('%s => %d' % (ip, count))

解析结果如下：

10.10.0.1 => 3
10.10.0.2 => 3
10.10.0.3 => 2
10.10.0.4 => 1

10.10.0.1 => 3

10.10.0.2 => 3

10.10.0.3 => 2

10.10.0.4 => 1

2. 找出最需要优化的URL

这里的题目是找出最需要优化的URL，也就是说不能光找出访问时间最大的URL，也不能光找出访问次数最多的URL；因为如果访问时间最大单它访问次数并不多，又或者它访问次数很多但是访问时间并不长。其实这个题目还是很有意思的。

这里给出模拟数据（页面,访问时间）：

$ cat access.log
"/page?1","0.10"
"/page?2","0.20"
"/page?1","0.10"
"/page?2","0.20"
"/page?3","0.30"
"/page?1","0.10"
"/page?2","0.20"
"/page?3","0.30"
"/page?4","0.40"

$ cat access.log

"/page?1","0.10"

"/page?2","0.20"

"/page?1","0.10"

"/page?2","0.20"

"/page?3","0.30"

"/page?1","0.10"

"/page?2","0.20"

"/page?3","0.30"

"/page?4","0.40"

第一版代码如下：

def Log(file):
    # 定义数据类型;
    count = dict()
    result = dict()
    total = 0

    # 把URL作为key,URL的访问总次数及访问总时间作为value,这个value是一个tuple,如:{'page':(3,1)};
    with open(file) as f:
        for line in f:
            tmp = line.split(',')
            url = tmp[0]
            time = tmp[1]
            url = url.replace('\"','')
            time = float(time.replace('\"',''))
            if url not in count.keys():
                count[url] = (0, 0)
            count[url] = (count[url][0] + 1, count[url][1] + time)
            total += 1

    # 解包操作,计算权重(平均时间乘以访问次数占比得到一个权重);
    for url, (count, res_time) in count.items():
        time = (res_time / count) * (count / total * 100) 
        result[url] = time

    # 解包操作,利用sorted对value进行排序操作;
    for url, sum_time in sorted(result.items(), key=lambda x: x[1], reverse=True)[0:]:
        print('{} => {}'.format(url, sum_time))

if __name__ == '__main__':
    Log('access.log')

def Log(file):

# 定义数据类型;

count = dict()

result = dict()

total = 0

# 把URL作为key,URL的访问总次数及访问总时间作为value,这个value是一个tuple,如:{'page':(3,1)};

with open(file) as f:

for line in f:

tmp = line.split(',')

url = tmp[0]

time = tmp[1]

url = url.replace('\"','')

time = float(time.replace('\"',''))

if url not in count.keys():

count[url] = (0, 0)

count[url] = (count[url][0] + 1, count[url][1] + time)

total += 1

# 解包操作,计算权重(平均时间乘以访问次数占比得到一个权重);

for url, (count, res_time) in count.items():

time = (res_time / count) * (count / total * 100)

result[url] = time

# 解包操作,利用sorted对value进行排序操作;

for url, sum_time in sorted(result.items(), key=lambda x: x[1], reverse=True)[0:]:

print('{} => {}'.format(url, sum_time))

if __name__ == '__main__':

Log('access.log')

得到结果如下：

/page?2 => 6.666666666666667
/page?3 => 6.666666666666666
/page?4 => 4.444444444444445
/page?1 => 3.3333333333333335

/page?2 => 6.666666666666667

/page?3 => 6.666666666666666

/page?4 => 4.444444444444445

/page?1 => 3.3333333333333335

整个脚本中，都是一些基础语法，并利用内置sorted函数进行排序。主要说一下权重这个概念，有点意思。我们用(res_time / count)得到URL访问平均时间，然后用(count / total * 100)得到URL访问次数占总URL的百分比；最后使用平均时间乘以URL访问次数占总URL的百分比得到一个权重（URL的权重值）。后面就是直接拿这个权重去排序。

这样一来，这个权重值就有点类似于取访问时间+访问次数的平均值了。也就是说我们根据权重进行降序，就会得到访问时间长，且访问频率高的URL依次排序。然后更有意思的是我们可以根据自己的需要，调整访问时间和访问频率的权重值。如下算法：

time = (res_time / count) ** 1 * (count / total * 100) ** 1

1	time = (res_time / count) ** 1 * (count / total * 100) ** 1

就是给平均访问时间设置乘方，以及访问频率占比设置乘方。如果你比较关心访问时间慢的URL，那么你可以调大平均访问时间的乘方，那么此时访问时间长的URL权重就会高一些，也就会排在前面，反之亦然（这里使用乘方是因为如果访问时间越长，那么乘方得到的值就越大，权重也就越高；反之亦然）。

这里我给(res_time / count) ** 2，然后看看重新排序的结果：

/page?3 => 1.9999999999999998
/page?4 => 1.7777777777777781
/page?2 => 1.3333333333333337
/page?1 => 0.3333333333333334

/page?3 => 1.9999999999999998

/page?4 => 1.7777777777777781

/page?2 => 1.3333333333333337

/page?1 => 0.3333333333333334

排序发生了变化，访问时间的权重高一些；所以访问时间等于0.3s，并且访问次数等于2次的/page?3排在了前面。虽然/page?4访问时间最长，但是其访问次数只有1次，所以综合来看权重没有/page?3高。我想如果你把平均访问时间变成(res_time / count) ** 3，那么/page?4一定会排在第一。

借这个案例，可以学习一下python类继承，把上面的脚本改造了一下，分成两个class写，改完之后稍微智能点了。

class LogResult(object):
    count = dict()
    result = dict()
    total = 0

    def __init__(self, file):
        with open(file) as f:
            for line in f:
                tmp = line.split(',')
                url = tmp[0]
                time = tmp[1]
                url = url.replace('\"','')
                time = float(time.replace('\"',''))
                if url not in self.count.keys():
                    self.count[url] = (0, 0)
                self.count[url] = (self.count[url][0] + 1, self.count[url][1] + time)
                self.total += 1

class LogHandler(LogResult):
    sortvalues = list()

    def __init__(self, file, time_weight=1, count_weight=1, reverse=True, row=None):

        super(LogHandler, self).__init__(file)

        for url, (count, res_time) in self.count.items():
            self.time = (res_time / count) ** time_weight * (count / self.total * 100) ** count_weight 
            self.result[url] = self.time

        if isinstance(row, int):
            for url, sum_time in sorted(self.result.items(), key=lambda x: x[1], reverse=reverse)[0:row]:
                self.sortvalues.append('{} => {}'.format(url, sum_time))
        else:
            for url, sum_time in sorted(self.result.items(), key=lambda x: x[1], reverse=reverse)[0:]:
                self.sortvalues.append('{} => {}'.format(url, sum_time))

if __name__ == '__main__':
    Log = LogHandler(file='access.log',time_weight=1, count_weight=1, row=3)
    for i in Log.sortvalues:
        print(i)

class LogResult(object):

count = dict()

result = dict()

total = 0

def __init__(self, file):

with open(file) as f:

for line in f:

tmp = line.split(',')

url = tmp[0]

time = tmp[1]

url = url.replace('\"','')

time = float(time.replace('\"',''))

if url not in self.count.keys():

self.count[url] = (0, 0)

self.count[url] = (self.count[url][0] + 1, self.count[url][1] + time)

self.total += 1

class LogHandler(LogResult):

sortvalues = list()

def __init__(self, file, time_weight=1, count_weight=1, reverse=True, row=None):

super(LogHandler, self).__init__(file)

for url, (count, res_time) in self.count.items():

self.time = (res_time / count) ** time_weight * (count / self.total * 100) ** count_weight

self.result[url] = self.time

if isinstance(row, int):

for url, sum_time in sorted(self.result.items(), key=lambda x: x[1], reverse=reverse)[0:row]:

self.sortvalues.append('{} => {}'.format(url, sum_time))

else:

for url, sum_time in sorted(self.result.items(), key=lambda x: x[1], reverse=reverse)[0:]:

self.sortvalues.append('{} => {}'.format(url, sum_time))

if __name__ == '__main__':

Log = LogHandler(file='access.log',time_weight=1, count_weight=1, row=3)

for i in Log.sortvalues:

print(i)

这里我使用LogHandler类继承LogResult类，也就继承了父类的方法和属性。

另外，在类的继承中，如果重定义某个方法，该方法会覆盖父类的同名方法，但有时，我们希望能同时实现父类的功能，这时，我们就需要调用父类的方法了，可通过使用super来实现。

我在LogHandler类中使用了一个同父类相同的方法名称__init__ ，所以在子类中是获取不到父类的方法及属性信息。我这里使用了super类，在子类内部初始化父类，这样一来使用就没有问题了。你也可以把子类中的__init__方法名称换掉，也不会有这个问题。

简单说一下下面这行代码：

def __init__(self, file, time_weight=1, count_weight=1, reverse=True, row=None):

1	def __init__(self, file, time_weight=1, count_weight=1, reverse=True, row=None):

file：传要处理的文件。

time_weight：设置时间权重。

count_weight：设置访问次数权重。

reverse：设置升序还是降序。

row：设置显示的行数。

Python类特性之继承与多态

如果您觉得本站对你有帮助，那么可以支付宝扫码捐助以帮助本站更好地发展，在此谢过。

您必须 登录 才能发表评论！

您必须登录才能发表评论！