Support E2E Test && Jenkins Build Pod Could not connect to Ryuk

Now we use DolphinScheduler version 2.0.5 to build our data platform, and we want to improve our platform stability and find the bugs as soon as pissboy, so we want to introduce an E2E test in our project. Writing for recording enabling the E2E test process and the resolution of the Could not connect to Ryuk question.

中文版

I. Causes

1、E2E Test

The community has a stable E2E testing module but it does not support version 2. x and can only be used in versions 3.0+ and above. Instead, the community solution is to build a new Docker image from the source code and use the Selenium framework to access the Docker container for testing.

But we have second coding from the community release version, integrated some jobs schooled like K8s, Flink, Spark, etc. And we need to test for Flink UI can access as an expert after the Flink job starts. So, using the Selenium framework to test the online service is better for our situation. So we plan to second coding from the community E2E module, then introduce it to our project.

2. Code Changing

The code changes aren’t really that much, The community’s E2E module has the logic of testing the local service. So we just need to change the E2E module about testing the local service to testing the online service.

1
2
3
4
5
6
7
DolphinSchedulerExtension.java

private void runInLocal() {
Testcontainers.exposeHostPorts(443);
address = HostAndPort.fromParts("https://your.service.com", 443);
rootPath = "/";
}

If your service uses https, change the protocol to https.

1
2
3
driver.get(new URL("http", address.getHost(), address.getPort(), rootPath).toString());
change to
driver.get(new URL("https", address.getHost(), address.getPort(), rootPath).toString());

Other start params changes and E2E detail logic you can see the official document here.

And we need to change the login logic and another case we want to test. Because Selenium framework is according website class name to position the add and delete button and simulate the browser to operate and test. Version 3.x refactored the front end, so the elements of the website are totally different between 2.x, we need to change the code of this area. Its part doesn’t have any difficulty, I won’t go into details.

3. Test local

After code changing, test local follow document guide, I need to add -Dm1_chip=true for my run param because my mac has M1 chip, it’s using to support ARM64 container.

Locally test run successfully, it can log in to our online service as expected, and test the tenant manager process.

We have deployed the Jenkins service, and our E2E test will run in Jenkins finally. We use K8s mode in our Jenkins service. Every new task will create a new pod and will be destroyed after finishing the task.

We have a successful test locally, and the next step will run the test in Jenkins, we thought will be the same result as locally, because our Unit Test and Online auto build have been running stably in Jenkins for several months, but something went wrong in the E2E module.

4. Test In Jenkins

Something went wrong in the E2E module when running in the Jenkins:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
2022-08-06 14:49:54,714 org.testcontainers.DockerClientFactory 190 [main] INFO  [] - Connected to docker: 
Server Version: 19.03.10
API Version: 1.40
Operating System: CentOS Linux 7 (Core)
Total Memory: 386685 MB
2022-08-06 14:49:54,877 docker[testcontainers/ryuk:0.3.3] 376 [main] INFO [] - Creating container for image: testcontainers/ryuk:0.3.3
2022-08-06 14:49:54,884 org.testcontainers.utility.RegistryAuthLocator 164 [main] INFO [] - Failure when attempting to lookup auth config. Please ignore if you don't have images in an authenticated registry. Details: (dockerImageName: testcontainers/ryuk:0.3.3, configFile: /root/.docker/config.json. Falling back to docker-java default behaviour. Exception message: /root/.docker/config.json (No such file or directory)
2022-08-06 14:49:55,088 docker[testcontainers/ryuk:0.3.3] 440 [main] INFO [] - Container testcontainers/ryuk:0.3.3 is starting: 066d1e071dfc6bce3a71c1321a4331d1b388f50ac25d7c9de3f41f31447500e2
2022-08-06 14:49:55,854 docker[testcontainers/ryuk:0.3.3] 520 [main] INFO [] - Container testcontainers/ryuk:0.3.3 started in PT1.13S
2022-08-06 14:50:00,863 org.testcontainers.utility.RyukResourceReaper 120 [testcontainers-ryuk] WARN [] - Can not connect to Ryuk at 172.17.0.1:32775
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_212]
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_212]
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_212]
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_212]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_212]
at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_212]
at org.testcontainers.utility.RyukResourceReaper.lambda$null$0(RyukResourceReaper.java:92) ~[testcontainers-1.17.3.jar:?]
at org.rnorth.ducttape.ratelimits.RateLimiter.doWhenReady(RateLimiter.java:27) ~[duct-tape-1.0.8.jar:?]
at org.testcontainers.utility.RyukResourceReaper.lambda$maybeStart$1(RyukResourceReaper.java:88) ~[testcontainers-1.17.3.jar:?]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_212]
......
2022-08-06 14:50:25,873 org.testcontainers.utility.RyukResourceReaper 131 [main] ERROR [] - Timed out waiting for Ryuk container to start. Ryuk's logs:
2022/08/06 06:49:55 Pinging Docker...
2022/08/06 06:49:55 Docker daemon is available!
2022/08/06 06:49:55 Starting on port 8080...
2022/08/06 06:49:55 Started!

[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 31.939 s <<< FAILURE! - in org.apache.dolphinscheduler.e2e.cases.TenantE2ETest
[ERROR] org.apache.dolphinscheduler.e2e.cases.TenantE2ETest Time elapsed: 31.937 s <<< ERROR!
java.lang.IllegalStateException: Could not connect to Ryuk at 172.17.0.1:32775

The error message is very clear, Could not connect to Ryuk.

II. Troubleshoot

1. Preliminary check

The first step when encountering a problem is to Google it. And I found it was a common problem. There are a lot of solutions on the web, like:

  • Somebody see when if you restart Docker can resolve the problem, but not works for me
  • Upgrade testcontainers-java to version 1.51.1, but I using 1.17.3 now
  • docker system prune. I have checked Docker has enough memory space, so I didn’t try this one
  • Disable Ryuk using config environment params TESTCONTAINERS_RYUK_DISABLED=true to avoid this problem. After researching, Ryuk is purge containers after tasks are finished, I have tried this one, and sure enough, it will not report an error if Ryuk does not start. But produces another error:
1
2
3
4
5
6
7
8
9
10
11
2022/08/06 08:51:49,364 docker[testcontainers/sshd:1.1.0] 440 [main] INFO  [] - Container testcontainers/sshd:1.1.0 is starting: a9a5ccc8addcedb374a290675893291214d2123b6e2a5dc5e30b9e54aa01e828
2022/08/06 08:52:50,405 docker[testcontainers/sshd:1.1.0] 529 [main] ERROR [] - Could not start container
org.testcontainers.containers.ContainerLaunchException: Timed out waiting for container port to open (172.17.0.1 ports: [32812] should be listening)
at org.testcontainers.containers.wait.strategy.HostPortWaitStrategy.waitUntilReady(HostPortWaitStrategy.java:102) ~[testcontainers-1.17.3.jar:?]
......
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 62.677 s <<< FAILURE! - in org.apache.dolphinscheduler.e2e.cases.TenantE2ETest
[ERROR] org.apache.dolphinscheduler.e2e.cases.TenantE2ETest Time elapsed: 62.675 s <<< ERROR!
org.testcontainers.containers.ContainerLaunchException: Container startup failed
Caused by: org.rnorth.ducttape.RetryCountExceededException: Retry limit hit with exception
Caused by: org.testcontainers.containers.ContainerLaunchException: Could not create/start container
Caused by: org.testcontainers.containers.ContainerLaunchException: Timed out waiting for container port to open (172.17.0.1 ports: [32812] should be listening)

See the error message is still can not connect with 172.17.0.1:32812, It means that no matter what the service is, it is actually impossible to connect. That is to say, Ryuk can not stop. this can not connect problem must be resolved.

2. Could not connect

The problem comes back to Ryuk, check the log again, and we can see that the Ryuk service has actually been started:

1
2
3
2022-08-06 14:49:55,088 docker[testcontainers/ryuk:0.3.3] 440 [main] INFO  [] - Container testcontainers/ryuk:0.3.3 is starting: 066d1e071dfc6bce3a71c1321a4331d1b388f50ac25d7c9de3f41f31447500e2
2022-08-06 14:49:55,854 docker[testcontainers/ryuk:0.3.3] 520 [main] INFO [] - Container testcontainers/ryuk:0.3.3 started in PT1.13S
2022-08-06 14:50:00,863 org.testcontainers.utility.RyukResourceReaper 120 [testcontainers-ryuk] WARN [] - Can not connect to Ryuk at 172.17.0.1:32775

Log in to Jenkins pod to check whether Ryuk service is actually starting or not.

1
2
root@jenkins-pipeline-ln62r:~/agent# docker ps | grep ryuk
f1298662d49f testcontainers/ryuk:0.3.3 "/app" 7 seconds ago Up 6 seconds 0.0.0.0:32772->8080/tcp testcontainers-ryuk-e84b7bc3-00cd-491f-9d47-e2d156e91669

We can see Ryuk service is starting, the question is could not connect, so we test that with wget command.

1
2
3
root@jenkins-pipeline-ln62r:~/agent# wget 172.17.0.1:32772
--2022-08-06 11:52:48-- http://172.17.0.1:32772/
Connecting to 172.17.0.1:32772...

Actually, we can see that it is indeed unable to connect. We log in to the host machine to test the service because the Jenkins task pod uses the host machine’s /run/docker.sock file.

1
2
3
4
[root@k8s-worker-01 ~]# curl 172.17.0.1:32772
ACK
ACK
ACK

The connection is successfully in the host machine.

And that time, I thought maybe is the networking setting problem with testcontainers. I have changed some parameters about Docker following the document, but still not working. Looks like this has nothing to do with the testcontainers. I thought maybe the question is still in Docker.

3. Firewall Troubleshooting

Rethink the problem, every new Jenkins’s task will create a new Docker container, and the Docker container will create a Ryuk service, and could not connect. At that time, I thought the problem is in Docker in Docker connection was refused. Suspect that Docker in Docker can not access the host machine port.

After thinking, I thought maybe it is a firewall issue because can not connect.

After inquiries, it is suspected that the host machine’s iptables rule denied the connection from Docker. So use the below command to allow all traffic from Docker to access all host’s ports.

1
iptables -A INPUT -i docker0 -j ACCEPT

But still not resolved the problem.

Saw this information here, and suspected that is host’s Docker configuration has issues and denied the access. So tried to change the host’s Docker configuration.

1
ExecStart=/usr/bin/dockerd -H tcp://0.0.0.0:4243 -H unix:///var/run/docker.sock

But still can’t solve the problem.

Suspected still host machine’s firewall blocked some access, so I disable the host machine’s firewall.

Still, the connection refused.

At the time I have seen the same problem on Github, Following the prompts, I used httpd to test.

1
2
docker run -d -p 9090:80 httpd
curl localhost:9090

It is found that the httpd service cannot be accessed.

That means our thinking is wrong, it is nothing to do with the firewall. Because all permissions are turned on, all firewalls are closed, but still can not access the service.

This time I found it is not easy to test, I need to run mvn commands to start the E2E test, and a Ryuk service will be created, and Ryuk will be destroyed after mvn commands are finished, and we can’t test the connectivity. Ryuk service only stays one or two minutes usually, and testing is extremely inconvenient. And I test the httpd service is also unable to connect, I found I can start an httpd service in the Jenkins task pod, and will not be destroyed. When I can access the httpd service from the pod, it means I resolved the problem.

4. Network Troubleshooting

Looks like there is a network issue.

Let’s check the Jenkins task pod network configuration.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
root@jenkins-pipeline-w34kq:~/agent# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1480
inet 172.17.213.171 netmask 255.255.255.255 broadcast 172.17.213.171
ether c2:76:3d:10:73:1e txqueuelen 0 (Ethernet)
RX packets 29561 bytes 176211133 (168.0 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 21634 bytes 5269854 (5.0 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
loop txqueuelen 1000 (Local Loopback)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

It looks cool, Jenkins task pod use bridge networks, and the host will have a virtual network card, IP segment 172.17.0.0/24.

Log in to the host machine to check the network configuration.

1
2
3
4
5
6
7
8
9
10
[root@k8s-worker-01 ~]# ifconfig
......
docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255
ether 02:42:7d:ac:d8:fb txqueuelen 0 (Ethernet)
RX packets 62624 bytes 3005816052 (2.7 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 69613 bytes 280415917 (267.4 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
......

Looks good too.

It’s kind of deadlocked now, I don’t know which configuration is wrong that is preventing the connection.

Can I change my mind, I can access the Ryuk service on the host machine, and Jenkins task pod using the host machine’s docker.sock file and can not connect. So can we integrate a full Docker service in the Jenkins task pod? Using its own docker.sock I thought will connect to the Ryuk service.

After researching documents, it’s also more different to implement, and I thought it’s not the best-resolved solution, I don’t think this is the root cause. This is equivalent to bypassing the problem and does not solve the core problem.

5. Host Mode Start

In the follow-up inquiries, I did not see this information by accident. If I start Nginx with host mode in a pod, can I access the Nginx service from this pod?

1
2
3
4
5
6
7
8
9
10
11
12
root@jenkins-pipeline-q68td:~/agent# docker run -d --name nginx --network host nginx
025906c124c9067f49c0eb1db7745540af88cea34b0094df0df815439e2d836b
root@jenkins-pipeline-q68td:~/agent# wget 172.17.0.1:80
--2022-08-07 10:52:43-- http://172.17.0.1/
Connecting to 172.17.0.1:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 615 [text/html]
Saving to: 'index.html'

index.html 100%[=====================================================================================================>] 615 --.-KB/s in 0s

2022-08-07 10:52:43 (43.7 MB/s) - 'index.html' saved [615/615]

I can’t believe that can access it. It means that start the docker container with host mode, we can access this from the pod.

Does it mean if I start testcontainers with host mode, I also can access this service, and the problem will be resolved?

Now the thinking is very clear, start the testcontainers with host mode.

I have tried changing the Java code, and specifying the start params with the host network, but not working. Another solution is to change docker starts params with host mode, but I can’t find the way. And after researching I can’t find another way to config the testcontainers to start with host mode.

Deadlocked again.

6. Jenkins With Host Mode

Then change my mind again, If I can change the Jenkins slaver pod with host mode, can I access other pods launching with bridge networks?

Jenkins other including this feature, just enable Host Network in Pod Templates module, configureClouds page, and change Jenkins URL to host mode’s URL.

After checking, there are no problems.

1
2
3
4
root@pdak8s-worker-01:~/agent# docker run -d -p 9090:80 httpd
8749834785fca955997b5cad94755f7abe12b00eb6d09cd3d33b36e6b53eb075
root@pdak8s-worker-01:~/agent# curl 172.17.0.1:9090
<html><body><h1>It works!</h1></body></html>

We rerun the E2E test to see if it can access the Ryuk service.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
2022-08-06 10:59:48,874 docker[testcontainers/ryuk:0.3.3] 440 [main] INFO  [] - Container testcontainers/ryuk:0.3.3 is starting: 0859ffdf6c6b6c06e0ea39acf6155aaa98bb837f2926640af8acd8d268d0d34b
2022-08-06 10:59:49,734 docker[testcontainers/ryuk:0.3.3] 520 [main] INFO [] - Container testcontainers/ryuk:0.3.3 started in PT1.497S
......
2022-08-06 10:59:49,747 docker[testcontainers/sshd:1.1.0] 376 [main] INFO [] - Creating container for image: testcontainers/sshd:1.1.0
2022-08-06 10:59:49,777 docker[testcontainers/sshd:1.1.0] 440 [main] INFO [] - Container testcontainers/sshd:1.1.0 is starting: fcbd8d04666b5ec6d39aaba8fa6671815bff9572b97428cd0f3581c03aba838c
2022-08-06 10:59:50,660 docker[testcontainers/sshd:1.1.0] 520 [main] INFO [] - Container testcontainers/sshd:1.1.0 started in PT0.913S
......
2022-08-06 10:59:50,898 docker[selenium/standalone-chrome-debug:3.141.59] 376 [main] INFO [] - Creating container for image: selenium/standalone-chrome-debug:3.141.59
2022-08-06 10:59:52,284 docker[selenium/standalone-chrome-debug:3.141.59] 440 [main] INFO [] - Container selenium/standalone-chrome-debug:3.141.59 is starting: 9ab1d8d6f9996e5a94920214d904c51c0b5767b65f862192a407f7e763bb80c8
2022-08-06 10:59:55,899 docker[selenium/standalone-chrome-debug:3.141.59] 520 [main] INFO [] - Container selenium/standalone-chrome-debug:3.141.59 started in PT5.001S
2022-08-06 10:59:55,902 docker[testcontainers/vnc-recorder:1.2.0] 376 [main] INFO [] - Creating container for image: testcontainers/vnc-recorder:1.2.0
2022-08-06 10:59:55,946 docker[testcontainers/vnc-recorder:1.2.0] 440 [main] INFO [] - Container testcontainers/vnc-recorder:1.2.0 is starting: 14e9dbcc3fafd3356113624117e792a18929fbb494c7d41b1bc62089c3a06f5f
2022-08-06 10:59:56,803 docker[testcontainers/vnc-recorder:1.2.0] 520 [main] INFO [] - Container testcontainers/vnc-recorder:1.2.0 started in PT0.902S
Aug 06, 2022 10:59:58 AM org.openqa.selenium.remote.ProtocolHandshake createSession
......

No problem at all, which means our problem is finally solved.

Ⅲ. Summarize

Now we look back, our original thinking was problematic, actually is not Docker In Docker.

Jenkins task pod is using the same /var/run/docker.sock file with the host machine.It means the Jenkins task pod uses the same Docker service as the host machine, therefore the Ryuk service started by the Jenkins task is actually started on the Docker of the host machine. That is Jenkins task pod and Ryuk are two Docker services at the same level on the host.

So I took a lot of detours, but fortunately, the problem was finally solved. I have left a message on the same problem in Github, because don’t repeat yourself.

Solved new problems, in fact, also led to new problems. Our Java service is also deployed in Docker and will connect with other services, like Nginx, and Redis, there serviced are also other Docker services, and they can connect with each other normally. So why Jenkins task pod can not connect with Ryuk? Two containers started in bridge mode.

To be continued.


评论

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×