[PATCH v2] spi: spi-geni-qcom: Speculative fix of "nobody cared" about interrupt

Discussion:

(too old to reply)

Douglas Anderson

2020-03-17 20:37:06 UTC

Every once in a while (like once in a few months on a device) people
have seen warnings on devices using spi-geni-qcom like:

irq ...: nobody cared (try booting with the "irqpoll" option)

...where the interrupt number listed matches up with "spi_geni" in
/proc/interrupts.

You can get "nobody cared" if the interrupt handler returns IRQ_NONE.
Once you get into this state the driver will just stop working.

Auditing the code makes me believe that it's probably not so good
checking "cur_mcmd" in the interrupt handler not under spinlock.
Let's move it to be under spinlock. Looking more closely at the code,
it looks as if there are some other places that could be under
spinlock, so let's add. It looks as if the original code was assuming
that by the time that the interrupt handler got called that the write
to "cur_mcmd" (to make it not CMD_NONE anymore) would make it to the
processor handling the interrupt. Perhaps with weakly ordered memory
this sometimes (though very rarely) happened.

Let's also add a warning (with the IRQ status) in the case that we
ever end up getting an interrupt when we think we shouldn't. This
would give us a better chance to debug if this patch doesn't help the
issue. We'll also try our best to clear the interrupt to hopefully
get us out of this state.

Patch is marked "speculative" because I have no way to reproduce this
problem, so I'm just hoping this fixes it. Weakly ordered memory
makes my brain hurt.

Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP")
Signed-off-by: Douglas Anderson <***@chromium.org>
---

Changes in v2:
- Detect true spurious interrupt.
- Still return IRQ_NONE for state machine mismatch, but print warn.

drivers/spi/spi-geni-qcom.c | 35 ++++++++++++++++++++++++++++++-----
1 file changed, 30 insertions(+), 5 deletions(-)

diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
index c3972424af71..92e51ccb427f 100644
--- a/drivers/spi/spi-geni-qcom.c
+++ b/drivers/spi/spi-geni-qcom.c
@@ -151,16 +151,19 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag)
struct geni_se *se = &mas->se;
unsigned long time_left;

- reinit_completion(&mas->xfer_done);
pm_runtime_get_sync(mas->dev);
if (!(slv->mode & SPI_CS_HIGH))
set_flag = !set_flag;

+ spin_lock_irq(&mas->lock);
+ reinit_completion(&mas->xfer_done);
+
mas->cur_mcmd = CMD_CS;
if (set_flag)
geni_se_setup_m_cmd(se, SPI_CS_ASSERT, 0);
else
geni_se_setup_m_cmd(se, SPI_CS_DEASSERT, 0);
+ spin_unlock_irq(&mas->lock);

time_left = wait_for_completion_timeout(&mas->xfer_done, HZ);
if (!time_left)
@@ -307,6 +310,8 @@ static void setup_fifo_xfer(struct spi_transfer *xfer,
u32 spi_tx_cfg, len;
struct geni_se *se = &mas->se;

+ spin_lock_irq(&mas->lock);
+
spi_tx_cfg = readl(se->base + SE_SPI_TRANS_CFG);
if (xfer->bits_per_word != mas->cur_bits_per_word) {
spi_setup_word_len(mas, mode, xfer->bits_per_word);
@@ -376,6 +381,8 @@ static void setup_fifo_xfer(struct spi_transfer *xfer,
*/
if (m_cmd & SPI_TX_ONLY)
writel(mas->tx_wm, se->base + SE_GENI_TX_WATERMARK_REG);
+
+ spin_unlock_irq(&mas->lock);
}

static int spi_geni_transfer_one(struct spi_master *spi,
@@ -478,13 +485,29 @@ static irqreturn_t geni_spi_isr(int irq, void *data)
struct geni_se *se = &mas->se;
u32 m_irq;
unsigned long flags;
-
- if (mas->cur_mcmd == CMD_NONE)
- return IRQ_NONE;
+ irqreturn_t ret = IRQ_HANDLED;

spin_lock_irqsave(&mas->lock, flags);
m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS);

+ /* Check for spurious interrupt */
+ if (!m_irq) {
+ ret = IRQ_NONE;
+ goto exit;
+ }
+
+ /*
+ * If we got a real interrupt but software state machine thinks
+ * we were idle then give a warning. We'll do our best to clear
+ * the interrupt but we'll still return IRQ_NONE. If this keeps
+ * happening the system will eventually disable us.
+ */
+ if (mas->cur_mcmd == CMD_NONE) {
+ pr_warn("Unexpected SPI interrupt: %#010x\n", m_irq);
+ ret = IRQ_NONE;
+ goto exit;
+ }
+
if ((m_irq & M_RX_FIFO_WATERMARK_EN) || (m_irq & M_RX_FIFO_LAST_EN))
geni_spi_handle_rx(mas);

@@ -523,9 +546,11 @@ static irqreturn_t geni_spi_isr(int irq, void *data)
complete(&mas->xfer_done);
}

+exit:
writel(m_irq, se->base + SE_GENI_M_IRQ_CLEAR);
spin_unlock_irqrestore(&mas->lock, flags);
- return IRQ_HANDLED;
+
+ return ret;
}

static int spi_geni_probe(struct platform_device *pdev)

--
2.25.1.481.gfbce0eb801-goog

Stephen Boyd

2020-03-17 21:36:05 UTC

Permalink

Quoting Douglas Anderson (2020-03-17 13:37:06)

Post by Douglas Anderson
Every once in a while (like once in a few months on a device) people
irq ...: nobody cared (try booting with the "irqpoll" option)
...where the interrupt number listed matches up with "spi_geni" in
/proc/interrupts.
You can get "nobody cared" if the interrupt handler returns IRQ_NONE.
Once you get into this state the driver will just stop working.
Auditing the code makes me believe that it's probably not so good
checking "cur_mcmd" in the interrupt handler not under spinlock.
Let's move it to be under spinlock. Looking more closely at the code,
it looks as if there are some other places that could be under
spinlock, so let's add. It looks as if the original code was assuming
that by the time that the interrupt handler got called that the write
to "cur_mcmd" (to make it not CMD_NONE anymore) would make it to the
processor handling the interrupt. Perhaps with weakly ordered memory
this sometimes (though very rarely) happened.
Let's also add a warning (with the IRQ status) in the case that we
ever end up getting an interrupt when we think we shouldn't. This
would give us a better chance to debug if this patch doesn't help the
issue. We'll also try our best to clear the interrupt to hopefully
get us out of this state.
Patch is marked "speculative" because I have no way to reproduce this
problem, so I'm just hoping this fixes it. Weakly ordered memory
makes my brain hurt.

It could be that. It could also be the poor design of geni_se_init() and
how it enables many interrupts that this driver doesn't look to handle?
Why do we allow the wrapper to enable all those bits in
M_COMMON_GENI_M_IRQ_EN and S_COMMON_GENI_S_IRQ_EN if nobody is going to
handle them?

Post by Douglas Anderson
Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP")
---
- Detect true spurious interrupt.
- Still return IRQ_NONE for state machine mismatch, but print warn.
drivers/spi/spi-geni-qcom.c | 35 ++++++++++++++++++++++++++++++-----
1 file changed, 30 insertions(+), 5 deletions(-)
diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
index c3972424af71..92e51ccb427f 100644
--- a/drivers/spi/spi-geni-qcom.c
+++ b/drivers/spi/spi-geni-qcom.c
@@ -151,16 +151,19 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag)
struct geni_se *se = &mas->se;
unsigned long time_left;
- reinit_completion(&mas->xfer_done);
pm_runtime_get_sync(mas->dev);
if (!(slv->mode & SPI_CS_HIGH))
set_flag = !set_flag;
+ spin_lock_irq(&mas->lock);

Why is this spin_lock_irq() vs. spin_lock_irqsave()? This isn't possible
to be called from somewhere that hasn't changed irq flags?

Post by Douglas Anderson
+ reinit_completion(&mas->xfer_done);
+
mas->cur_mcmd = CMD_CS;
if (set_flag)
geni_se_setup_m_cmd(se, SPI_CS_ASSERT, 0);
else
geni_se_setup_m_cmd(se, SPI_CS_DEASSERT, 0);
+ spin_unlock_irq(&mas->lock);

This will force on interrupts if they were masked.

Post by Douglas Anderson
time_left = wait_for_completion_timeout(&mas->xfer_done, HZ);
if (!time_left)
@@ -307,6 +310,8 @@ static void setup_fifo_xfer(struct spi_transfer *xfer,
u32 spi_tx_cfg, len;
struct geni_se *se = &mas->se;
+ spin_lock_irq(&mas->lock);
+
spi_tx_cfg = readl(se->base + SE_SPI_TRANS_CFG);
if (xfer->bits_per_word != mas->cur_bits_per_word) {
spi_setup_word_len(mas, mode, xfer->bits_per_word);
@@ -376,6 +381,8 @@ static void setup_fifo_xfer(struct spi_transfer *xfer,
*/
if (m_cmd & SPI_TX_ONLY)
writel(mas->tx_wm, se->base + SE_GENI_TX_WATERMARK_REG);
+
+ spin_unlock_irq(&mas->lock);
}
static int spi_geni_transfer_one(struct spi_master *spi,
@@ -478,13 +485,29 @@ static irqreturn_t geni_spi_isr(int irq, void *data)
struct geni_se *se = &mas->se;
u32 m_irq;
unsigned long flags;
-
- if (mas->cur_mcmd == CMD_NONE)
- return IRQ_NONE;
+ irqreturn_t ret = IRQ_HANDLED;
spin_lock_irqsave(&mas->lock, flags);
m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS);

Does this read need to be inside the lock?

Post by Douglas Anderson
+ /* Check for spurious interrupt */
+ if (!m_irq) {
+ ret = IRQ_NONE;
+ goto exit;

I ask because it could be simplified by reading the status and then
immediately returning IRQ_NONE if no bits are set without having to do
the lock/unlock dance. And does writing 0 to the irq clear register do
anything?

Post by Douglas Anderson
+ }
+
+ /*
+ * If we got a real interrupt but software state machine thinks
+ * we were idle then give a warning. We'll do our best to clear
+ * the interrupt but we'll still return IRQ_NONE. If this keeps
+ * happening the system will eventually disable us.
+ */
+ if (mas->cur_mcmd == CMD_NONE) {
+ pr_warn("Unexpected SPI interrupt: %#010x\n", m_irq);
+ ret = IRQ_NONE;
+ goto exit;
+ }
+
if ((m_irq & M_RX_FIFO_WATERMARK_EN) || (m_irq & M_RX_FIFO_LAST_EN))
geni_spi_handle_rx(mas);
@@ -523,9 +546,11 @@ static irqreturn_t geni_spi_isr(int irq, void *data)
complete(&mas->xfer_done);
}
writel(m_irq, se->base + SE_GENI_M_IRQ_CLEAR);
spin_unlock_irqrestore(&mas->lock, flags);
- return IRQ_HANDLED;
+
+ return ret;
}
static int spi_geni_probe(struct platform_device *pdev)

Doug Anderson

2020-03-17 22:08:45 UTC

Permalink

Hi,

Post by Stephen Boyd

Post by Douglas Anderson
Patch is marked "speculative" because I have no way to reproduce this
problem, so I'm just hoping this fixes it. Weakly ordered memory
makes my brain hurt.

It is possible it's related to an interrupt that we don't handle. However:

* IMO having the locking in place is safer anyway. At some point I
read that advice that trying to reason about weakly ordered memory was
simply too hard for the average person (even the average kernel
developer). In 99% of the cases you could just use a lock so it's
super clear and the performance difference is near zero.

* Most of the cases I saw of the "nobody cared" for geni-spi was on a
mostly idle system (presumably still doing periodic SPI transactions
to the EC, though). It seems weird that one of these other interrupts
would suddenly fire. It seems more likely that we just happened to
win a race of some sort.

If nothing else it will suddenly become very obvious after my patch
lands because I'll print out the status.

That all being said if someone wants to submit a patch to clean up
which interrupts are enabled I'd support it.

Post by Stephen Boyd

Post by Douglas Anderson
@@ -151,16 +151,19 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag)
struct geni_se *se = &mas->se;
unsigned long time_left;
- reinit_completion(&mas->xfer_done);
pm_runtime_get_sync(mas->dev);
if (!(slv->mode & SPI_CS_HIGH))
set_flag = !set_flag;
+ spin_lock_irq(&mas->lock);

Why is this spin_lock_irq() vs. spin_lock_irqsave()? This isn't possible
to be called from somewhere that hasn't changed irq flags?

See below.

Post by Stephen Boyd

This will force on interrupts if they were masked.

I'll change it if you want, but in this function there is already a
call to "wait_for_completion_timeout". That's not gonna be too happy
if this function is ever called with interrupts already masked. Also
in this function is pm_runtime_get_sync() which in many cases will
sleep (I think we'll end up in geni_se_clks_on() which calls
clk_bulk_prepare_enable()).

In cases where you know for sure that interrupts aren't masked,
spin_lock_irq() and spin_unlock_irq() are fine and that's what they're
for, no?

Post by Stephen Boyd

...and just to answer the same question for here: setup_fifo_xfer() is
called from spi_geni_transfer_one() which is our "transfer_one"
function. We don't happen to block anywhere in these functions, but
I'm nearly certain you are allowed to block in them. We actually
return a positive number to indicate to the SPI core that we're not
doing the blocking ourselves but since the SPI core can't know we were
going to do that it has to assume we might block.

Post by Stephen Boyd

Post by Douglas Anderson
@@ -478,13 +485,29 @@ static irqreturn_t geni_spi_isr(int irq, void *data)
struct geni_se *se = &mas->se;
u32 m_irq;
unsigned long flags;
-
- if (mas->cur_mcmd == CMD_NONE)
- return IRQ_NONE;
+ irqreturn_t ret = IRQ_HANDLED;
spin_lock_irqsave(&mas->lock, flags);

Ironically the above could probably just be "spin_lock" since this is
our interrupt handler. ;-) I'll just leave it alone though since
what's there now doesn't hurt.

Post by Stephen Boyd

Post by Douglas Anderson
m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS);

Does this read need to be inside the lock?

Probably not. Discussion below.

Post by Stephen Boyd

Post by Douglas Anderson
+ /* Check for spurious interrupt */
+ if (!m_irq) {
+ ret = IRQ_NONE;
+ goto exit;

Sure. I'll move it if you want. It felt nicer to just keep the whole
thing under lock so I didn't have to think about whether it mattered.
...and anyone else looking at it didn't need to think about if it
mattered, too. It it is very easy to say that it doesn't _hurt_ to
have it under lock other than having one extra memory read under lock.
...and presumably the case where the optimization matters is
incredibly rare (a spurious interrupt) and we just spent a whole lot
of cycles calling the interrupt handler to begin with for this
spurious interrupt.

I would have a hard time believing that a write of 0 to a "write 1 to
clear" register would have any impact. It would be a pretty bad
hardware design...

So I guess in summary, I'm not planning to spin for any of these
things unless you really insist or you say I'm wrong about something
above or someone else says my opinions are the wrong opinions.

-Doug