bug in tulip_rx? patch

From: Wolfgang Walter (wolfgang.walter@stusta.mhn.de)
Date: Tue Nov 23 1999 - 20:30:24 EST


Hi Donald,

First (as I post it on linux-kernel, too) again a short description of my
problem:

Since we use 2.2.14pre4 instead of 2.0.38 we observe random hangs of the
network interface which is a tulip card.

I tried older tulip-drivers, but this happens with all of them.

Now I had a closer look at the code and found a situation which will cause
a hang (see my older mail on tulip-bug).

I now could catch the driver this state (switched suspend mode):

tulip-diag.c:v1.19 10/2/99 Donald Becker (becker@cesdis.gsfc.nasa.gov)
Index #1: Found a Digital DC21041 Tulip adapter at 0xe000.
Digital DC21041 Tulip chip registers at 0xe000:
  ffe08000 ffffffff ffffffff 002f5810 002f5a10 fc680000 fffe2002 ffffebef
  fffe0001 ffff4bf8 ffffffff fffe0000 000001c8 ffffef05 ffffff3f ffff0008
 Port selection is half-duplex.
 Transmit started, Receive started, half-duplex.
  The Rx process state is 'Suspended -- no Rx buffers'.
  The Tx process state is 'Idle'.
  The transmit unit is set to store-and-forward.
  The NWay status register is 000001c8.
PCI Subsystem IDs, vendor 1186, device 0100.
CardBus Information Structure at offset 00000000.
Ethernet MAC Station Address 00:80:C8:E6:1C:2F.
EEPROM transceiver/media description for the Digital DC21041 Tulip chip.
Leaf node at offset 30, default media type 0800 (Autosense).
 3 transceiver description blocks:
  21041 media index 00 (10baseT).
  21041 media index 04 (10baseT-Full Duplex).
  21041 media index 01 (10base2).
  Internal autonegotiation state is 'Autonegotiation disabled'.

The driver does not recover from this state.

I attached a patch which seems to work. But it runs only since some hours.
I post it here so that you may have a look at it. Also, there seem to be people
who have the same problem, they may try it out.

Wolfgang Walter

--- 2.2.14pre4/drivers/net/tulip.c Fri Nov 5 18:02:35 1999
+++ 2.2.14pre4-m/drivers/net/tulip.c Wed Nov 24 02:04:13 1999
@@ -509,6 +509,7 @@
         int interrupt; /* In-interrupt flag. */
         unsigned int cur_rx, cur_tx; /* The next free ring entry */
         unsigned int dirty_rx, dirty_tx; /* The ring entries to be free()ed. */
+ unsigned int nr_skbs_rx; /* number of skbs */
         unsigned int tx_full:1; /* The Tx queue is full. */
         unsigned int full_duplex:1; /* Full-duplex operation requested. */
         unsigned int full_duplex_lock:1;
@@ -2484,6 +2485,7 @@
         tp->tx_full = 0;
         tp->cur_rx = tp->cur_tx = 0;
         tp->dirty_rx = tp->dirty_tx = 0;
+ tp->nr_skbs_rx = 0;
 
         for (i = 0; i < RX_RING_SIZE; i++) {
                 tp->rx_ring[i].status = 0x00000000;
@@ -2503,6 +2505,7 @@
                 tp->rx_skbuff[i] = skb;
                 if (skb == NULL)
                         break;
+ tp->nr_skbs_rx++;
                 skb->dev = dev; /* Mark as being used by this device. */
                 tp->rx_ring[i].status = cpu_to_le32(DescOwned); /* Owned by Tulip chip */
                 tp->rx_ring[i].buffer1 = virt_to_le32desc(skb->tail);
@@ -2607,7 +2610,11 @@
                 if ((csr5 & (NormalIntr|AbnormalIntr)) == 0)
                         break;
 
- if (csr5 & (RxIntr | RxNoBuf))
+ /* tp->nr_skbs_rx == 0 can only happen if we never succeeded to allocate at least
+ * one skb (which we usually do at initialisation time.
+ * We then hope do do so now or any later non-RD/RU interrupt
+ */
+ if (csr5 & (RxIntr | RxNoBuf) || tp->nr_skbs_rx == 0)
                         work_budget -= tulip_rx(dev);
 
                 if (csr5 & (TxNoBuf | TxDied | TxIntr)) {
@@ -2797,7 +2804,7 @@
 #endif
                         /* Check if the packet is long enough to accept without copying
                            to a minimally-sized skbuff. */
- if (pkt_len < rx_copybreak
+ if ((pkt_len < rx_copybreak || tp->nr_skbs_rx < 2)
                                 && (skb = dev_alloc_skb(pkt_len + 2)) != NULL) {
                                 skb->dev = dev;
                                 skb_reserve(skb, 2); /* 16 byte align the IP header */
@@ -2809,9 +2816,10 @@
                                            pkt_len);
 #endif
                                 work_done++;
- } else { /* Pass up the skb already on the Rx ring. */
+ } else if (tp->nr_skbs_rx > 1) { /* Pass up the skb already on the Rx ring. */
                                 char *temp = skb_put(skb = tp->rx_skbuff[entry], pkt_len);
                                 tp->rx_skbuff[entry] = NULL;
+ tp->nr_skbs_rx--;
 #ifndef final_version
                                 if (le32desc_to_virt(tp->rx_ring[entry].buffer1) != temp)
                                         printk(KERN_ERR "%s: Internal fault: The skbuff addresses "
@@ -2820,6 +2828,9 @@
                                                    le32desc_to_virt(tp->rx_ring[entry].buffer1),
                                                    skb->head, temp);
 #endif
+ } else {
+ /* we have only one skb; we must keep it: drop the packet */
+ tp->stats.rx_dropped++;
                         }
                         skb->protocol = eth_type_trans(skb, dev);
                         netif_rx(skb);
@@ -2840,11 +2851,55 @@
                         skb = tp->rx_skbuff[entry] = dev_alloc_skb(PKT_BUF_SZ);
                         if (skb == NULL)
                                 break;
+ tp->nr_skbs_rx++;
                         skb->dev = dev; /* Mark as being used by this device. */
                         tp->rx_ring[entry].buffer1 = virt_to_le32desc(skb->tail);
                         work_done++;
                 }
                 tp->rx_ring[entry].status = cpu_to_le32(DescOwned);
+ }
+
+ /*
+ * We must catch the situation:
+ * the card run out of RX-buffers when this interrupt was triggered (but we have a skb available)
+ *
+ * This is the case, if
+ * tp->nr_skbs_rx > 0
+ * and
+ * all buffers are dirty
+ * <=> tp->dirty_rx + RX_RING_SIZE == tp->cur_rx
+ * <=> tp->cur_rx has no skb associated
+ *
+ * Why?
+ * =>
+ * If at this stage all buffers are dirty this means that
+ * tp->dirty_rx + RX_RING_SIZE == tp->cur_rx
+ * must hold. This means that the above refill did not change tp->dirty_rx at all
+ * which means that the associated skb must be null. As this is the same in the ring as
+ * tp->cur_rx represents, there is no skb associated with tp->cur_rx.
+ *
+ * <=
+ * If tp->cur_rx has no skb associated the rx-loop must be terminated with
+ * rx_work_limit<0 (tp->rx_ring[entry].status & cpu_to_le32(DescOwned) can't be
+ * true if no skb is associated). This means that
+ * tp->dirty_rx + RX_RING_SIZE == tp->cur_rx
+ * holds which means that all buffers are dirty
+ *
+ */
+ entry = tp->cur_rx % RX_RING_SIZE;
+ if (tp->nr_skbs_rx > 0 && tp->rx_skbuff[entry] == NULL) {
+ int i;
+ /* search a free buffer */
+ for (i = 0; i < RX_RING_SIZE; i++) {
+ if (tp->rx_skbuff[i] != NULL) {
+ tp->rx_skbuff[entry] = tp->rx_skbuff[i];
+ tp->rx_ring[entry].buffer1 = tp->rx_ring[i].buffer1;
+ tp->rx_skbuff[i] = NULL;
+ tp->rx_ring[entry].status = cpu_to_le32(DescOwned);
+ tp->dirty_rx++;
+ break;
+ }
+ }
         }
 
         return work_done;



This archive was generated by hypermail 2b29 : Wed Jan 12 2000 - 12:00:16 EST