transfersite failures

Troy · July 29, 2024, 12:33pm

Password change emails are not every account but I received one for a test account and so did quite a few people so I believe that’s enough of a concern to ensure suppression during a migration.

Migrating ↔ Valid Nameservers:
Nobody needs a 17 year history, just a simple answer. Turns out disabling the proxy link on the source server pre-transfer is a bad idea, thanks for the heads up.

duplicated DNS records:
Yes, it’s reliably happening on more than 50% of account transfers.
I have to login and delete the duplicate record and resubmit the transfer.

~]# cpcmd -d domain.com dns:get-records 'roundcube' A domain.com
-
  name: roundcube.domain.com.
  subdomain: roundcube
  domain: domain.com
  class: IN
  ttl: 14400
  parameter: 167.253.62.1
  rr: A
-
  name: roundcube.domain.com.
  subdomain: roundcube
  domain: domain.com
  class: IN
  ttl: 14400
  parameter: 54.39.154.240
  rr: A

The biggest outstanding issue from my end is this issue with duplicated DNS records. There is absolutely no reason why any account would have two of those records to begin with and no way to prove that without doing a before and after dump of DNS records pre-transfer.

I need a fix for this before I can do anymore migrations because fixing DNS zones one at a time is a real time suck.

msaladna · July 29, 2024, 4:23pm

I would need to be able to reliably reproduce the problem to determine what’s happening. Password reset notices are suppressed as indicated above, so what you’re seeing is an aberration that further information is required to evaluate.

From the backtrace earlier,

It’s looking for 54.39.163.98 but only 54.39.154.240 + 167.253.62.1 are present. Is this for an addon domain or the primary domain? Before migrating is IP for webmail OK? After initial migration (stage 0), do we have these bogus records? If you re-run the second leg of migration on source, does it trigger the same backtrace?

./bin/scripts/transfersite.php --stage=1 --force -s NEW.SERVER.NAME DOMAIN.COM

I’m looking for things that can be reliably reproduced so I have personal assurance it can be repeated unlimited times until properly resolved. Any clarifying information helps me isolate test cases.

Troy · July 29, 2024, 4:30pm

I have nothing to share / prove with the password reset emails, only that they happened and there were quite a few.

As for the duplicated DNS records… From my examples, I’m sharing from two servers.

OLD → NEW
54.39.163.98 → 167.253.62.0
54.39.154.240 → 167.253.62.1

There is no reason why there would be crossover so I probably shared examples from each server.

This happened a week ago or so with my last couple of migrations and it’s happening reliably on this new batch of migrations and now I dread starting another round next week because of this.

msaladna · July 30, 2024, 6:28pm

Is there any pattern to the generation? Does it happen to the primary user? Does it happen to secondary users? Does it happen to both? I cannot reproduce, so it would be difficult to carve out time without added information.

DNS records are fetched from the remote server, then updated. Error generated is because PowerDNS no longer sees this record despite reporting it - that’s strange.

With the latest release from today, v3.2.44 - run a migration on a site or two. Check DNS before the second stage migration occurs to ensure there is only 1 A record for each of the 3 webmail subdomains. Additionally, records will be correctly updated if it contains multiple matching records.

Troy · July 30, 2024, 11:24pm

Ok, I’ll let you know.

While you’re making changes to default DNS stuff. If the _apnscp_uuid record is missing, instead of reporting that it doesn’t exist, can it be created? I’ve noticed some people trim their DNS zone down to bare minimum and delete things they don’t understand.

msaladna · July 31, 2024, 1:52am

There’s two types of people: those who don’t mind the record and those who absolutely do mind as it discloses potentially sensitive information about the platform or related domains. UUID is used primarily for multi-server arbitration if the zone appears on a variety of servers. If absent - and how I used to manage domain removal - you can resolve by comparing your domain-server relationship in cp-proxy.

Likewise if you would like to add it back, it may be done so with dns:reset or Toolbox > Reset DNS to default.

Troy · August 7, 2024, 12:33pm

Do you think this issue with duplicating / updating of mail, horde and roundcube records could have anything to do with a custom DNS template? (I’ve removed the horde record but it was there).

config/custom/resources/templates/dns/email.blade.php

{!! ltrim(implode('.', [$subdomain, $zone]), '.') !!}. {!! $ttl !!} IN MX 10 mail.{{ $zone }}.
{!! ltrim(implode('.', [$subdomain, $zone]), '.') !!}. {!! $ttl !!} IN MX 20 mail.{{ $zone }}.
{!! ltrim(implode('.', [$subdomain, $zone]), '.') !!}. {!! $ttl !!} IN TXT "v=spf1 a mx include:relay.mailchannels.net ?all"
@foreach($ips as $ip)
@php $rr = false === strpos($ip, ':') ? 'A' : 'AAAA'; @endphp
@foreach(['mail','roundcube'] as $mailsub)
{!! ltrim(implode('.', [$mailsub, $zone]), '.') !!}. {!! $ttl !!} IN {!! $rr !!} {!! $ip !!}
@endforeach
@endforeach

msaladna · August 8, 2024, 11:02am

These are the present reference templates for core DNS as well as email DNS.

My previous advice holds,

Troy · August 15, 2024, 1:55am

Ran a few migrations with no issues, then proceeded to migrate 320 accounts with roughly 550 domains and ended up with a 15-20% failure rate due to duplicated DNS records for mail and/or roundcube only. After deleting one of the records manually, I could redo the transfer with most of those succeeding and only a few having duplicated records again.

msaladna · August 15, 2024, 4:20pm

These ad hoc statistics don’t provide anything useful other than you’re presently experiencing the issue, which is what we already know.

I’d like to see a sample error backtrace from the latest transfer that failed. Given changes in the PowerDNS code, this backtrace allows me to indirectly see what version of code is running by line offsets as well as file location (module overrides do happen).

It may be worthwhile to check the database schema directly for duplicate records for domains that may trigger the issue:

SELECT r2.name, r2.type, records.content FROM (SELECT name, type FROM records GROUP BY name,type HAVING COUNT(name) > 1) r2 JOIN records ON(r2.name = records.name AND r2.type = records.type);

Troy · August 15, 2024, 5:03pm

Well, I tried what you said, performed a couple of transfers and no issues.

Maybe you can add the debug code to the transfer site script, then we can enable debugging when doing transfers and it can log the information you require to a file that we can provide to you when things inevitably fail.

It’s easier for me to just fix the DNS zones as transfer fail than it is to jump through all the steps you want and then hope that’s enough before your next reply a day or two later. If you can’t fix it based on everything in this thread and others, then fine I’ll deal with it until I’m done upgrading to Rocky 8.

msaladna · August 15, 2024, 5:06pm

env DEBUG=1 would accomplish this. I need more data points to reliably track this issue down. As I cannot reproduce directly, I’d need more legwork from you to track this down. Right now the issue is localized to you; I’ve not heard of this issue affecting others who have migrated. Perhaps this is linked back to your Galera cluster? One thing to consider, when using AXFR/NOTIFY in DNS, these negative caches get purged. With Galera replicating to a backend, I don’t believe slaves receive a NOTIFY event to purge the negative DNS cache.

It’s hard to know without further information, so that’s why I’m asking these questions. Do you have duplicate records present for these affected webmail addresses? I don’t know, which is why I presented the question.

If it were cut and dry I wouldn’t be wasting our time asking these questions.

Troy · August 16, 2024, 12:50pm

I changed the pdns server configs to set the negcache ttl to be very low, this may have solved any neg cache issues but this issue with duplicated records is still a thing and not for every account so I’m unsure what the issue is. When I rebuiled my DNS cluster, I’ll probably go the AFXR/NOTIFY route but if I remember correctly it wasn’t practical at the time due to how deleting zones worked or something. It was a long time ago and there was lots of discussion about it when it was very early in the integration process.

As for doing an env DEBUG=1 during transfers, that’s great if I know in advance an account is going to fail, but doing that with a --all is impractical. Again, if there was a --debug=true option with transfersite that created a site by site log file in a /usr/local/apnscp/storage/logs/migration/20240816/site124-stage0.log or something, that would make this part easier.

msaladna · August 16, 2024, 3:40pm

Do it the old fashioned way:

cd /home/virtual
for site in site* ; do 
    /usr/local/apnscp/bin/scripts/transfersite.php -s newserver $site > $site/fst/xfer.log 2>&1
done

Troy · August 16, 2024, 6:26pm

SELECT r2.name, r2.type, records.content FROM (SELECT name, type FROM records WHERE type = 'A' AND name LIKE '%roundcube%' GROUP BY name,type HAVING COUNT(name) > 1) r2 JOIN records ON(r2.name = records.name AND r2.type = records.type);

Returns 0

SELECT r2.name, r2.type, records.content FROM (SELECT name, type FROM records WHERE type = 'A' AND name LIKE '%mail%' GROUP BY name,type HAVING COUNT(name) > 1) r2 JOIN records ON(r2.name = records.name AND r2.type = records.type);

Returns 4 records

Just type = A returns a bunch of results but most of them are for 3rd party services like wix / squarespace and have NEVER been an issue with migrations / transfers. It’s only ever been A records and only ever been roundcube, horde, mail and very rarely (like a couple times) www.

I’ll start a new migration with the old fashioned way and see what happens.

What do you want from me?

Troy · August 16, 2024, 7:13pm

Ok, doing a small batch, using the following simple loop script.

#!/bin/bash

SERVER=new.server.com

cd /home/virtual
loop=1
while [ $loop -le 15 ];
do
        SITENUM=site$loop
        if [[ -d "$SITENUM" ]]; then
                echo "Beginning transfer of $SITENUM - `cpcmd admin:get_meta_from_site "$SITENUM" siteinfo | grep domain | awk '{print $2}'` to $SERVER"
                env DEBUG=1 /usr/local/apnscp/bin/scripts/transfersite.php --server=$SERVER $SITENUM >> $SITENUM/fst/xfer.log 2>&1
        else
                echo "/home/virtual/$SITENUM doesn't exist, skipping"
        fi
        ((loop=loop+1))
done

Ran 25 migrations on two new servers and no errors yet, other than a potentially corrupt MySQL database. Will keep trying in small batches and hopefully it’s no longer an issue.

Troy · August 16, 2024, 7:22pm

Welp, re-ran the SELECT r2.name, r2.type, records.content FROM (SELECT name, type FROM records WHERE type = 'A' AND name like '%roundcube%' GROUP BY name,type HAVING COUNT(name) > 1) r2 JOIN records ON(r2.name = records.name AND r2.type = records.type); query and it returns 32 results. These duplicates of the old server and new server and part of this 25 account batch migration.

Ironically, these accounts didn’t fail. The only difference this time around is running in debug mode.

msaladna · August 16, 2024, 7:52pm

PM me a log from one of those that succeeded and has duplicate records. What’s the panel version on source and destination as well? Again, not asking these questions to make it more onerous, but rather so I have a clear picture of what’s going on.

Troy · August 17, 2024, 1:39am

Sent you a message on discord.

Troy · August 17, 2024, 2:46am

Is Stage 0 supposed to set new records with the new server IP? That seems like stage 1 type of stuff.

DELETE FROM records WHERE id IN (SELECT records.id 
	FROM (
		SELECT NAME, TYPE FROM records
		WHERE TYPE = 'A'
		AND NAME LIKE '%roundcube%'
		OR NAME LIKE '%mail%'
		OR NAME LIKE '%www%'
		GROUP BY NAME,TYPE HAVING COUNT(NAME) > 1
	) r2
JOIN records ON(r2.name = records.name AND r2.type = records.type AND records.content IN ('167.253.62.6','167.253.62.7')));

Is my temp fix / workaround.