Linux内核RED队列实现

关于RED队列的TC配置命令,参见:RED队列tc设置

1 RED入队列

首先计算平均队列长度qavg,参见之后函数red_calc_qavg的介绍。如果队列处于空闲状态,退出空闲状态,清空空闲开始时间戳。

static int red_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free)
{
    struct red_sched_data *q = qdisc_priv(sch);
    struct Qdisc *child = q->qdisc;
    int ret;

    q->vars.qavg = red_calc_qavg(&q->parms,
                     &q->vars, child->qstats.backlog);

    if (red_is_idling(&q->vars))
        red_end_of_idle_period(&q->vars);

函数red_action根据预设参数和平均队列长度等值,确定要采取的动作。a)RED_DONT_MARK,不进行标记(ECN或者丢弃); b)RED_PROB_MARK,如果RED算法开启了ECN功能,并且成功设置报文的CE位,结束处理,否则,调整到congestion_drop标签处,将报文丢弃。c)RED_HARD_MARK,如果RED配置中指定了harddrop选项,不尝试设置ECN的CE标志位,直接执行报文丢弃处理。

    switch (red_action(&q->parms, &q->vars, q->vars.qavg)) {
    case RED_DONT_MARK:
        break;

    case RED_PROB_MARK:
        qdisc_qstats_overlimit(sch);
        if (!red_use_ecn(q) || !INET_ECN_set_ce(skb)) {
            q->stats.prob_drop++;
            goto congestion_drop;
        }

        q->stats.prob_mark++;
        break;

    case RED_HARD_MARK:
        qdisc_qstats_overlimit(sch);
        if (red_use_harddrop(q) || !red_use_ecn(q) ||
            !INET_ECN_set_ce(skb)) {
            q->stats.forced_drop++;
            goto congestion_drop;
        }

        q->stats.forced_mark++;
        break;
    }

对于未被丢弃的报文,在此处添加到队列中,并将队列长度递增一。

    ret = qdisc_enqueue(skb, child, to_free);
    if (likely(ret == NET_XMIT_SUCCESS)) {
        qdisc_qstats_backlog_inc(sch, skb);
        sch->q.qlen++;
    } else if (net_xmit_drop_count(ret)) {
        q->stats.pdrop++;
        qdisc_qstats_drop(sch);
    }
    return ret;

congestion_drop:
    qdisc_drop(skb, sch, to_free);
    return NET_XMIT_CN;

2 平均队列长度计算

如下计算平均队列长度(qavg)的函数red_calc_qavg,根据当前队列是否处于空闲状态,分成两个处理分支。

static inline unsigned long red_calc_qavg(const struct red_parms *p,
                      const struct red_vars *v, unsigned int backlog)
{
    if (!red_is_idling(v))
        return red_calc_qavg_no_idle_time(p, v, backlog);
    else
        return red_calc_qavg_from_idle_time(p, v);
}

非空闲状态的队列,由函数red_calc_qavg_no_idle_time计算平均队列长度,计算公式如下:

a v g ← ( 1 − w q ) a v g + w q q = a v g + w q ∗ q − a v g ∗ w q avg \leftarrow \left ( 1-w_{q} \right )avg+w_{q}q = avg + w_{q}*q - avg *w_{q} avg(1wq)avg+wqq=avg+wqqavgwq

接下来,将等式的两端都除以wq,得到如下结果:

a v g w q ← = a v g w q + q − a v g \frac{avg}{w_{q}} \leftarrow = \frac{avg}{w_{q}} + q - avg wqavg=wqavg+qavg

根据tc命令设置流程,可知,wq=1.0/(1<<Wlog) , 并且,内核使用的变量qavg的值等于:(avg << Wlog),详情可参见:RED队列tc设置。将Wlog带入以上等式,可得:

a v g ≪ W l o g ← = a v g ≪ W l o g + q − a v g avg \ll Wlog \leftarrow = avg \ll Wlog + q - avg avgWlog=avgWlog+qavg

在将qavg替换以上等式中的avg,得到:

q a v g ← = q a v g + q − ( q a v g ≫ W l o g ) qavg \leftarrow = qavg + q - \left ( qavg\gg Wlog \right ) qavg=qavg+q(qavgWlog)

函数red_calc_qavg_no_idle_time使用以上公式计算得到非空闲状态的平均队列长度。

static inline unsigned long red_calc_qavg_no_idle_time(const struct red_parms *p,
                               const struct red_vars *v, unsigned int backlog)
{
    /*
     * NOTE: v->qavg is fixed point number with point at Wlog.
     * The formula below is equvalent to floating point version:
     *
     *  qavg = qavg*(1-W) + backlog*W;
     *
     * --ANK (980924)
     */
    return v->qavg + (backlog - (v->qavg >> p->Wlog));
}

3 空闲状态qavg计算

计算空闲状态队列在接收到报文时的avg,由如下函数red_calc_qavg_from_idle_time实现。RED算法给出的公式如下:

a v g ← ( 1 − w q ) m a v g avg\leftarrow \left ( 1-w_{q} \right )^{m}avg avg(1wq)mavg

其中,m的值由以下计算取得:

m ← ( t i m e − q _ t i m e ) s m\leftarrow \frac{\left ( time-q\_time \right )}{s} ms(timeq_time)

由于在Stab数组中保存了预先计算好的(1-W)^m得以2为底的对数,这里使用空闲时长右移Scell_log位作为索引,取出计算好的值,执行右移操作即可。

static inline unsigned long red_calc_qavg_from_idle_time(const struct red_parms *p,
                             const struct red_vars *v)
{
    s64 delta = ktime_us_delta(ktime_get(), v->qidlestart);
    long us_idle = min_t(s64, delta, p->Scell_max);
    /*
     * The problem: ideally, average length queue recalcultion should
     * be done over constant clock intervals. This is too expensive, so
     * that the calculation is driven by outgoing packets.
     * When the queue is idle we have to model this clock by hand.
     *
     * SF+VJ proposed to "generate":
     *
     *  m = idletime / (average_pkt_size / bandwidth)
     *
     * dummy packets as a burst after idle time, i.e.
     *
     *  v->qavg *= (1-W)^m
     *
     * This is an apparently overcomplicated solution (f.e. we have to
     * precompute a table to make this calculation in reasonable time)
     * I believe that a simpler model may be used here, but it is field for experiments.
     */
    shift = p->Stab[(us_idle >> p->Scell_log) & RED_STAB_MASK];

    if (shift)
        return v->qavg >> shift;

如果shift的值为零,即空闲时长小于一个cell的大小,其中cell的时长等于(1 << p->Scell_log)。将空间时长扩大qavg倍,在此检查其对应的cell索引,如果其小于qavg的一半,返回qavg减去us_idle的值,否则,返回qavg的一半。

    else {
        /* Approximate initial part of exponent with linear function:
         *
         *  (1-W)^m ~= 1-mW + ...
         *
         * Seems, it is the best solution to problem of too coarse exponent tabulation.
         */
        us_idle = (v->qavg * (u64)us_idle) >> p->Scell_log;

        if (us_idle < (v->qavg >> 1))
            return v->qavg - us_idle;
        else
            return v->qavg >> 1;
    }

4 RED处理动作

首先看一下RED的随机数,函数red_random计算RED使用的随机数,其中max_P_reciprocal = 1 / (max_P / (qth_max - qth_min)) = (qth_max - qth_min) / max_P,其中此处的max_P等于配置的最大标记概率扩大了2^32倍,即 max_P = probability * pow(2, 32)。最终的随机值qR = (prandom_u32() / max_P) * (qth_max - qth_min)。

如果将max_P进行还原,缩小2^32倍的话,qR=([0,1] / original_max_P) * (qth_max - qth_min),根据公式原始的original_max_P = (qth_max-qth_min)/2^Plog,得到qR的范围为:[0, (2^Plog)]。

static inline u32 red_random(const struct red_parms *p)
{
    return reciprocal_divide(prandom_u32(), p->max_P_reciprocal);
}
static inline void red_set_parms(struct red_parms *p, u32 qth_min, u32 qth_max, u8 Wlog, ...)
{             
    int delta = qth_max - qth_min;

    max_p_delta = max_P / delta;
    max_p_delta = max(max_p_delta, 1U);
    p->max_P_reciprocal  = reciprocal_value(max_p_delta);

函数red_mark_probability计算报文标记概率。RED算法的报文标记判断为:(qcount < R/Pb)即 Pbqcount < R, 其中Pb = original_max_P(qavg - qth_min)/(qth_max-qth_min),得到以下不等式:

o r i g i n a l _ m a x _ P ∗ ( q a v g − q t h _ m i n ) ( q t h _ m a x − q t h _ m i n ) ∗ p c o u n t < R \frac{original\_max\_P*(qavg - qth\_min)}{(qth\_max-qth\_min)}*pcount < R (qth_maxqth_min)original_max_P(qavgqth_min)pcount<R

根据qR=(R / original_max_P) * (qth_max - qth_min)求得R = (qR / (qth_max - qth_min))*original_max_P,将R带入以上公式,可得:

o r i g i n a l _ m a x _ P ∗ ( q a v g − q t h _ m i n ) ( q t h _ m a x − q t h _ m i n ) ∗ p c o u n t < q R ( q t h _ m a x − q t h _ m i n ) ∗ o r i g i n a l _ m a x _ P \frac{original\_max\_P*(qavg - qth\_min)}{(qth\_max-qth\_min)}*pcount < \frac{qR}{(qth\_max - qth\_min)}*original\_max\_P (qth_maxqth_min)original_max_P(qavgqth_min)pcount<(qth_maxqth_min)qRoriginal_max_P

之后,由于qavg与qth_min都是经过了Wlog左移得到的值,将其进行还原,精简之后的不等式为:

( ( q a v g − q t h _ m i n ) ≫ W l o g ) ∗ p c o u n t < q R ((qavg - qth\_min)\gg Wlog)*pcount < qR ((qavgqth_min)Wlog)pcount<qR

函数red_mark_probability判断,如果以上不等式成立,不对报文进行标记;否则,如果不成立,执行报文标记操作。

static inline int red_mark_probability(const struct red_parms *p,
                       const struct red_vars *v, unsigned long qavg)
{
    /* The formula used below causes questions.

       OK. qR is random number in the interval
        (0..1/max_P)*(qth_max-qth_min) i.e. 0..(2^Plog). If we used floating point
       arithmetics, it would be: (2^Plog)*rnd_num, where rnd_num is less 1.

       Taking into account, that qavg have fixed point at Wlog, two lines
       below have the following floating point equivalent:

       max_P*(qavg - qth_min)/(qth_max-qth_min) < rnd/qcount
       Any questions? --ANK (980924)
     */
    return !(((qavg - p->qth_min) >> p->Wlog) * v->qcount < v->qR);

最后,看一下RED动作函数red_action,当平均队列长度qavg小于最小阈值qth_min时,不执行标记操作;当qavg大于等于最大阈值qth_max时,标记所有接收报文;当qavg处于两者之间时,根据RED随机算法执行标记操作。

由于变量qavg、qth_min和qth_max都是经过Wlog左移位之后的值,可进行直接比较。

static inline int red_cmp_thresh(const struct red_parms *p, unsigned long qavg)
{
    if (qavg < p->qth_min)
        return RED_BELOW_MIN_THRESH;
    else if (qavg >= p->qth_max)
        return RED_ABOVE_MAX_TRESH;
    else
        return RED_BETWEEN_TRESH;
}

函数red_mark_probability返回值不为零意味着执行标记操作,并且更新qR随机值。另外,当qcount为-1时,也进行qR值的更新。

static inline int red_action(const struct red_parms *p,
                 struct red_vars *v, unsigned long qavg)
{
    switch (red_cmp_thresh(p, qavg)) {
        case RED_BELOW_MIN_THRESH:
            v->qcount = -1;
            return RED_DONT_MARK;

        case RED_BETWEEN_TRESH:
            if (++v->qcount) {
                if (red_mark_probability(p, v, qavg)) {
                    v->qcount = 0;
                    v->qR = red_random(p);
                    return RED_PROB_MARK;
                }
            } else
                v->qR = red_random(p);

            return RED_DONT_MARK;

        case RED_ABOVE_MAX_TRESH:
            v->qcount = -1;
            return RED_HARD_MARK;
    }

5 ECN标记功能

如果在tc配置red时指定了ecn或者harddrop选项,内核中将设置标志TC_RED_ECN和TC_RED_HARDDROP,以下两个函数获取这两个标志位。

static inline int red_use_ecn(struct red_sched_data *q)
{
    return q->flags & TC_RED_ECN;
}

static inline int red_use_harddrop(struct red_sched_data *q)
{
    return q->flags & TC_RED_HARDDROP;
}

在RED的入队列函数中,如果red_action的返回结果为RED_PROB_MARK,并且设置了ecn标志(red_use_ecn),并且对报文进行ECN标记成功,增加标记统计计数prob_mark,报文添加仅队列。否则,增加报文丢弃计数prob_drop,将报文丢弃。

对于动作为RED_HARD_MARK的情况,如果设置了harddrop选项(red_use_harddrop),执行报文丢弃;否则,与RED_PROB_MARK处理逻辑相同。

static int red_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free)
{
    ...    
    switch (red_action(&q->parms, &q->vars, q->vars.qavg)) {
    case RED_DONT_MARK:
        break; 
                
    case RED_PROB_MARK:
        qdisc_qstats_overlimit(sch);
        if (!red_use_ecn(q) || !INET_ECN_set_ce(skb)) {
            q->stats.prob_drop++;
            goto congestion_drop;
        }       
    
        q->stats.prob_mark++;
        break;
        
    case RED_HARD_MARK: 
        qdisc_qstats_overlimit(sch);
        if (red_use_harddrop(q) || !red_use_ecn(q) ||
            !INET_ECN_set_ce(skb)) {
            q->stats.forced_drop++;
            goto congestion_drop;
        }

函数INET_ECN_set_ce对报文进行标记,支持IPv4和IPv6协议。

static inline int INET_ECN_set_ce(struct sk_buff *skb)
{
    switch (skb->protocol) {
    case cpu_to_be16(ETH_P_IP):
        if (skb_network_header(skb) + sizeof(struct iphdr) <=
            skb_tail_pointer(skb))
            return IP_ECN_set_ce(ip_hdr(skb));
        break;

    case cpu_to_be16(ETH_P_IPV6):
        if (skb_network_header(skb) + sizeof(struct ipv6hdr) <=
            skb_tail_pointer(skb))
            return IP6_ECN_set_ce(skb, ipv6_hdr(skb));
        break;
    }

    return 0;

对于IPv4协议,如果报文支持ECN,在IP头中的tos字段设置INET_ECN_CE(0x3)标志,并且,重新计算IP头部的校验和。

static inline int IP_ECN_set_ce(struct iphdr *iph)
{
    u32 check = (__force u32)iph->check;
    u32 ecn = (iph->tos + 1) & INET_ECN_MASK;

    /*
     * After the last operation we have (in binary):
     * INET_ECN_NOT_ECT => 01
     * INET_ECN_ECT_1   => 10
     * INET_ECN_ECT_0   => 11
     * INET_ECN_CE      => 00
     */
    if (!(ecn & 2))
        return !ecn;

    /*
     * The following gives us:
     * INET_ECN_ECT_1 => check += htons(0xFFFD)
     * INET_ECN_ECT_0 => check += htons(0xFFFE)
     */
    check += (__force u16)htons(0xFFFB) + (__force u16)htons(ecn);

    iph->check = (__force __sum16)(check + (check>=0xFFFF));
    iph->tos |= INET_ECN_CE;
    return 1;

6 RED出队列

报文由队列中取走之后,将队列长度递减一。如果队列中没有报文,并且队列处于非空闲状态,开启空闲状态,设置空闲时间戳。

static struct sk_buff *red_dequeue(struct Qdisc *sch)
{
    struct sk_buff *skb;
    struct red_sched_data *q = qdisc_priv(sch);
    struct Qdisc *child = q->qdisc;

    skb = child->dequeue(child);
    if (skb) {
        qdisc_bstats_update(sch, skb);
        qdisc_qstats_backlog_dec(sch, skb);
        sch->q.qlen--;
    } else {
        if (!red_is_idling(&q->vars))
            red_start_of_idle_period(&q->vars);
    }
    return skb;

内核版本 5.0

©️2020 CSDN 皮肤主题: 编程工作室 设计师:CSDN官方博客 返回首页