Posts by Demis

1) Message boards : Number crunching : How many cores are required? (Message 721)
Posted 1 hour ago by Demis
Post:
The reasons may be different and there are many of them.
For example, let's start with the simplest question:

When you launched the boinc-client, how many tasks were installed in the boinc-client for simultaneous calculation?
2) Questions and Answers : Server any other problems : Some data has been corrected (Message 719)
Posted 1 day ago by Demis
Post:
Next step.
Duplicates.

As my data analytics showed:

Duplicates are divided into two large groups:
1. Duplicates at the physical level.
2. Duplicates at the logical level.

And in point 2. The values from step 1 are absolutely always included.

Now the duplicates at the physical level have been eliminated.
3) Questions and Answers : Server any other problems : Some data has been corrected (Message 716)
Posted 8 days ago by Demis
Post:
Two tasks out of 162 are scheduled for re-counting today for crunchers.
If these tasks do not have problems, then the remaining 160 tasks will also be reassigned for recalculation.

The first of two has been received.
...
The mechanism for searching and reassigning problematic answers is now clear.
Work in this direction continues...

160 tasks now is reassigned for recalculation.
4) Questions and Answers : Server any other problems : Some data has been corrected (Message 714)
Posted 10 days ago by Demis
Post:
Two tasks out of 162 are scheduled for re-counting today for crunchers.
If these tasks do not have problems, then the remaining 160 tasks will also be reassigned for recalculation.

The first of two has been received.
https://boinc.termit.me/adsl/workunit.php?wuid=15413

The check showed that the result is now correct
https://boinc.termit.me/adsl/spt_explore.php?spt=16&s=4687939591477390991
https://boinc.termit.me/adsl/spt_explore.php?spt=16&s=4687939755461166661
https://boinc.termit.me/adsl/spt_explore.php?spt=16&s=4687939864673207491
https://boinc.termit.me/adsl/spt_explore.php?spt=16&s=4687941031016593387
https://boinc.termit.me/adsl/spt_explore.php?tpt=14&s=4687939512514954517
https://boinc.termit.me/adsl/spt_explore.php?stpt=10&s=4687940599903351247
https://boinc.termit.me/adsl/spt_explore.php?stpt=10&s=4687940605722467279
https://boinc.termit.me/adsl/spt_explore.php?stpt=10&s=4687940688488835077

The mechanism for searching and reassigning problematic answers is now clear.
Work in this direction continues...
5) Questions and Answers : Server any other problems : Some data has been corrected (Message 713)
Posted 16 days ago by Demis
Post:
There errors in the data database that arose due to hardware errors in the crunchers.
There are not many of them - about 2-3 thousand
(The total number of responses from crunchers is more than 15,000,000).

As of now we have
888 incorrect answers from crunchers.
All of these incorrect answers come from 162 workunits.
Two tasks out of 162 are scheduled for re-counting today for crunchers.
If these tasks do not have problems, then the remaining 160 tasks will also be reassigned for recalculation.
6) Questions and Answers : Automatically generated server job : SPT task (Message 712)
Posted 17 days ago by Demis
Post:
Batch 138: 10160928384525935453 begin: 10160928379391935453, end: 10160928389659935453
Count: 1
Make overlap from -5134000000 and +5134000000 . This is special wu created for overlap_135_137
7) Questions and Answers : Automatically generated server job : SPT task (Message 711)
Posted 17 days ago by Demis
Post:
Batch 137: 9911328384525935453 .. 10160928384525935453 -1
Count: 128000
Continue from 9,91E+18
8) Questions and Answers : Server any other problems : Some data has been corrected (Message 709)
Posted 18 days ago by Demis
Post:
There errors in the data database that arose due to hardware errors in the crunchers.
There are not many of them - about 2-3 thousand
(The total number of responses from crunchers is more than 15,000,000).

As of now we have
888 incorrect answers from crunchers.
All of these incorrect answers come from 162 workunits.

Total number of workunits issued: 2 691 776
Total number of values received: 16 867 165

2691776 = 100%
162 = X
x=0,0060183313916165386718657124515561%

16867165 = 100%
888 = X
x=0,0052646665874199961878596669920523%

These are just current error statistics.
9) Message boards : Number crunching : Problem with validation. (Message 708)
Posted 21 days ago by Demis
Post:
...
...
That is, the file was deleted by the standard file_deleter utility, despite the fact that the transaction was canceled and the database should not have changed.
Incredible!!!

The thing is that file_deleter works on its own.
And the data for deletion is taken from the database (the table has special fields that are responsible for this moment).

And this means only one thing, that the "flag to delete" was set in the database despite the cancellation of the transaction due to an error in "First".
Nonsense!
How is that even possible?

Such an event immediately raises an incredibly wide range of completely different questions.
Is this a bug in our assimilator code?
Is this a bug in the boink server code?
Is this a mistake in the settings of the battle server?
Is this a transactional model error?
Is this a bug in MySQL?
Is this a bug in MySQL settings?
Is this a bug in the settings of the OS on which the boink-server is running?
This error definitely does not occur in the case of data processing when "all the necessary files are there"?
How can we trust the received data?
And so on...

There are more than a thousand questions...
It probably takes more than one year to write them all.

Therefore, item 3 now has no answer.
And I just don't know what to do with it...

That's all for now.

P.S.
And please remember: I don't have much free time to do this project.
Therefore, while I am answering your questions, nothing else and more important is being done on the project.
It's like this: "either we sit on the forum and blah blah blah" "or we do something necessary for the project."
It is not possible at the same time.
That's why I don't answer your questions very often.

And in conclusion, I repeat, when I was asked a question on Monday, I answered it briefly "Yes"
https://boinc.termit.me/adsl/forum_thread.php?id=15&postid=365 .

But, in any case, your comments are welcome!

https://www.jitbit.com/alexblog/203-what-if-drivers-were-hired-like-programmers/

This issue has been clearly identified and resolved.
10) Questions and Answers : Server any other problems : Some data has been corrected (Message 707)
Posted 25 days ago by Demis
Post:
И наконец.
Сейчас вы написали, что кворум=2 решит проблему ошибок, возникающих из-за аппаратного сбоя.

Yes.
Quorum 2 eliminates this problem.

Что помешало вам за несколько месяцев ввести уже этот кварум=2?

Здравый смысл...
11) Questions and Answers : Server any other problems : Some data has been corrected (Message 706)
Posted 25 days ago by Demis
Post:
Пересмотрите снова.

Что я должна пересмотреть???

https://boinc.termit.me/adsl/forum_thread.php?id=67&postid=696#696

Вы писали

"We see:
tuple find: 5499120046153320487 k=16 kind=0 (spt) deriv=0 ofs=54 30 10 2 34 20 22 2
tuple find: 5499120251551369451 k=16 kind=0 (spt) deriv=0 ofs=30 20 76 36 20 10 50 4
tuple find: 5499120773905876457 k=16 kind=0 (spt) deriv=0 ofs=2 24 18 46 24 18 30 32"

Я спросила
Последнее решение (выделенное красным) неверное?

Вы ответили "Да".

Я прошу вас показать верное решение в понятном мне формате с паттерном и с начальным элементом кортежа (ваш формат я не понимаю).
Это трудно показать?

Я показываю в формате как хранится и извлекается сервером.
Чтобы исключить не верность пересчета в банальном.
Мне так проще.
Неужели это не понятно?

Или "нет времени"?

А ещё прошу рассказать, как вы нашли правильное решение взамен показанного неправильного.
Как писал выше - только пересчетом. Т.е. такое задание было пересчитано программой spt которая есть у каждого кранчера.
И таки - да.
Это занимает два часа времени.
На том компьютере где это у меня считалось.


Вот это правильный результат (выделенный зелёным)?

tuple find: 5499120046153320487 k=16 kind=0 (spt) deriv=0 ofs=54 30 10 2 34 20 22 2
tuple find: 5499120251551369451 k=16 kind=0 (spt) deriv=0 ofs=30 20 76 36 20 10 50 4
tuple find: 5499120773581271527 k=16 kind=0 (spt) deriv=0 ofs=22 18 30 12 50 18 52 8
Да


Повторяю вопрос: как вы нашли правильное решение?
Пересчитал.

Вот стандартная запись кортежа
5499120773581271527: [0, 22, 40, 70, 82, 132, 150, 202, 210, 262, 280, 330, 342, 372, 390, 412]

Эту запись я и просила вас привести.
Пожалуйста:
5499120046153320487: [0, 54, 84, 94, 96, 130, 150, 172, 174, 196, 216, 250, 252, 262, 292, 346]
5499120251551369451: [0, 30, 50, 126, 162, 182, 192, 242, 246, 296, 306, 326, 362, 438, 458, 488]
5499120773581271527: [0, 22, 40, 70, 82, 132, 150, 202, 210, 262, 280, 330, 342, 372, 390, 412]
5499121289947186217: [0, 44, 50, 66, 110, 134, 140, 156, 164, 180, 186, 210, 254, 270, 276, 320]
5499121372440344689: [0, 34, 52, 54, 108, 154, 258, 264, 358, 364, 468, 514, 568, 570, 588, 622]
5499120954814009877: [0, 2, 72, 74, 132, 134, 144, 146, 222, 224, 342, 344, 384, 386]
5499121634173665539: [0, 2, 18, 20, 30, 32, 42, 44, 60, 62]
Но перепроверяйте сами, т.к. делал вручную.


Итак, каким же чудесным образом вы угадали правильное решение???
Вы пересчитывали всё WU?
Да.


А что во всём WU?
Показал выше.

Какой интервал кранчер получает в одном WU?

5499119934525935453..5499121884525935453 (step:1950000000000)
12) Questions and Answers : Server any other problems : Some data has been corrected (Message 702)
Posted 25 days ago by Demis
Post:
Вот, например, у вас (это то, что можно как-то воспринять)

"We see:
tuple find: 5499120046153320487 k=16 kind=0 (spt) deriv=0 ofs=54 30 10 2 34 20 22 2
tuple find: 5499120251551369451 k=16 kind=0 (spt) deriv=0 ofs=30 20 76 36 20 10 50 4
tuple find: 5499120773905876457 k=16 kind=0 (spt) deriv=0 ofs=2 24 18 46 24 18 30 32"
Последнее решение (выделенное красным) неверное?
Да.
Но уточню, что оно и все последующие за ним

Я правильно понимаю.
Правильно.


Это решение появилось именно из-за аппаратного сбоя?
Предполагаю, что да.

Или это другого типа ошибка?
Не похоже.


The number of entries in the list is also different.

Я не знаю, что такое "количество записей в списке".
Там список строк начинающихся со слов "tuple find:"
Плюс счетчик всех этих строк "count:"


Это число начинает неправильную 16-ку?
5499120773905876457
Это вообще не понятное число, в том смысле, что неизвестно откуда оно появилось. Его там быть не должно.

Пожалуйста, покажите её полностью, то есть с паттерном, чтобы я могла проверить этот кортеж.
Это и показано в моем посте.
Смотрите внимательно k=16 или я не правильно Вас понял.

Пусть он неправильный.

1. Если этот неправильный кортеж появился из-за аппаратного сбоя у кранчера, вы можете определить причину этого сбоя?

Боюсь, что нет.
Вариантов, причин происходящего, было рассмотрено большое количество.
Но достоверного ответа - нет ни одного.
Ошибка у кранчеров возникает на разных компьютерах, разных пользователей, разном железе.
Но очень не часто.

Закономерностей не обнаружено.
Всего 0.02% из более 15 миллионов ответов.

У меня был алгоритм как найти плохие ответы.
Но сейчас он утерян.
(Ноутбук сломался еще в январе).
Когда находил список "проблемных" решений, оставалось его только пересчитать локально.
Чтобы перепроверить, что это действительно не правильные данные от кранчера.


2. Вы можете сказать, какой кортеж вместо этого неправильного должен быть записан в БД?
Правильный я Вам сразу привел, чтобы видно было и можно было сравнить. Т.е. "что получено" от кранчера и "как должно быть". Пересмотрите снова.


Ведь чтобы исправить эту ошибку, вам необходимо знать верный результат.
Разве не так?
Только пересчитывать. (Локально или через кранчеров - это уже частности.)


Пожалуйста, отвечайте по-русски.
Слишком много вопросов!
Именно! Но Вам-же это не интересно было? Тема была поднята в моем письме от 12-го августа.

Вряд ли мы и по-русски разберёмся.
Прошу отвечать на все мои вопросы по порядку и подробно.
Только когда есть время.


Если же вы считаете это "опять бла-бла-бла", продолжайте гнать брак.
Это не от меня зависит.

Ведь аппаратные сбои у кранчеров никто не отменил, как я понимаю.
И эти компьютеры продолжают посылать в БД неправильные решения.

Да.
Мной неоднократно писалось, что есть более важные задачи.
И это одна из них.

И также я писал, что работа продолжается.
Есть разные соображения, что с этим делать.
Но они пока не оформлены в коде.
13) Questions and Answers : Server any other problems : Some data has been corrected (Message 696)
Posted 26 days ago by Demis
Post:
Example BAD data:
Read data from file 'wu_431428_803879_spt_101_5499119934525935453_1_366_output.dat' :
ident:5499119934525935453
start:5499119934525935453
chkpt:5499121884525935473
last:5499121884525936859
step (last-start):1950000001406
step (chkpt-start):1950000000020
nprime: 2338848025
status: 1
status2: 2
sieve_init_cs: 208
twin_gap_d: 886
twin_gap_6d: 400
data:
tuple find: 5499120046153320487 k=16 kind=0 (spt) deriv=0 ofs=54 30 10 2 34 20 22 2
tuple find: 5499120251551369451 k=16 kind=0 (spt) deriv=0 ofs=30 20 76 36 20 10 50 4
tuple find: 5499120773905876457 k=16 kind=0 (spt) deriv=0 ofs=2 24 18 46 24 18 30 32
tuple find: 5499120970980195589 k=16 kind=0 (spt) deriv=0 ofs=10 20 4 18 12 48 2 34
tuple find: 5499121126535252239 k=13 kind=0 (spt) deriv=0 ofs=24 6 60 42 18 12
tuple find: 5499121316841673511 k=16 kind=0 (spt) deriv=0 ofs=2 24 24 28 24 56 4 14
tuple find: 5499121483035391399 k=16 kind=0 (spt) deriv=0 ofs=40 8 22 12 8 22 48 32
tuple find: 5499121558853990699 k=16 kind=0 (spt) deriv=0 ofs=14 16 2 42 6 10 54 86
tuple find: 5499121733722103317 k=16 kind=0 (spt) deriv=0 ofs=42 10 44 34 2 58 42 2
tuple find: 5499121775895727829 k=16 kind=0 (spt) deriv=0 ofs=12 56 46 116 10 8 34 8
tuple find: 5499121234826549117 k=10 kind=1 (stpt) deriv=0 ofs=2 10 2 58 2
tuple find: 5499121440577613711 k=10 kind=1 (stpt) deriv=0 ofs=2 34 2 40 2
tuple find: 5499121475147240399 k=10 kind=1 (stpt) deriv=0 ofs=2 16 2 10 2
tuple find: 5499121591027137257 k=10 kind=1 (stpt) deriv=0 ofs=2 28 2 28 2
tuple find: 5499121666242136481 k=10 kind=1 (stpt) deriv=0 ofs=2 4 2 40 2
tuple find: 5499121680143694047 k=10 kind=1 (stpt) deriv=0 ofs=2 28 2 10 2
tuple find: 5499121740427855217 k=10 kind=1 (stpt) deriv=0 ofs=2 10 2 28 2
tuple find: 5499121817057916077 k=10 kind=1 (stpt) deriv=0 ofs=2 28 2 28 2
end data.
primes.empty() = 0
count: 18
Done.

All binary data fields is correct.
Do not have destroyed nothing.

But correct data (for these task) is:
Read data from file 'output_101_5499119934525935453-manual.dat' :
ident:5499119934525935453
start:5499119934525935453
chkpt:5499121884525935473
last:5499121884525936859
step (last-start):1950000001406
step (chkpt-start):1950000000020
nprime: 2240302156
status: 1
status2: 2
sieve_init_ms: 4080 (4 sec)
twin_gap_d: 886
twin_gap_6d: 400
data:
tuple find: 5499120046153320487 k=16 kind=0 (spt) deriv=0 ofs=54 30 10 2 34 20 22 2
tuple find: 5499120251551369451 k=16 kind=0 (spt) deriv=0 ofs=30 20 76 36 20 10 50 4
tuple find: 5499120773581271527 k=16 kind=0 (spt) deriv=0 ofs=22 18 30 12 50 18 52 8
tuple find: 5499121289947186217 k=16 kind=0 (spt) deriv=0 ofs=44 6 16 44 24 6 16 8
tuple find: 5499121372440344689 k=16 kind=0 (spt) deriv=0 ofs=34 18 2 54 46 104 6 94
tuple find: 5499120954814009877 k=14 kind=2 (tpt) deriv=0 ofs=70 58 10 76 118 40
tuple find: 5499121634173665539 k=10 kind=1 (stpt) deriv=0 ofs=2 16 2 10 2
end data.
primes.empty() = 0
count: 7
Done.

We see:
tuple find: 5499120046153320487 k=16 kind=0 (spt) deriv=0 ofs=54 30 10 2 34 20 22 2
tuple find: 5499120251551369451 k=16 kind=0 (spt) deriv=0 ofs=30 20 76 36 20 10 50 4
tuple find: 5499120773905876457 k=16 kind=0 (spt) deriv=0 ofs=2 24 18 46 24 18 30 32
...
...
and
tuple find: 5499120046153320487 k=16 kind=0 (spt) deriv=0 ofs=54 30 10 2 34 20 22 2
tuple find: 5499120251551369451 k=16 kind=0 (spt) deriv=0 ofs=30 20 76 36 20 10 50 4
tuple find: 5499120773581271527 k=16 kind=0 (spt) deriv=0 ofs=22 18 30 12 50 18 52 8
...
...
After line 2 the data is incorrect.
The number of entries in the list is also different.
14) Questions and Answers : Server any other problems : Some data has been corrected (Message 694)
Posted 26 days ago by Demis
Post:
Yes.
Quorum 2 eliminates this problem.
15) Questions and Answers : Server any other problems : Some data has been corrected (Message 692)
Posted 26 days ago by Demis
Post:
There was an error in the assimilator source code published by Tomas.
Because of this error, some of the correct “ofs” results calculated and sent by crunchers showed incorrect start values in the database and, accordingly, on the website too.
There were about 50,000 of them.

There errors in the data database that arose due to hardware errors in the crunchers.
There are not many of them - about 2-3 thousand
(The total number of responses from crunchers is more than 15,000,000).

Such problems will be identified, disqualified, and published for re-counting.
Work on this continues...

It is important to understand that iron errors in crunchers are impossible to predict.
They can only be found after a response has been received.
16) Questions and Answers : Server any other problems : Some data has been corrected (Message 688)
Posted 27 days ago by Demis
Post:
Some tasks counted by crunchers went into an error state.
However, no credits were assigned for these tasks.

Moreover, it is absolutely certain that these tasks were calculated normally.
And the answer from them is present in the database.

For all such tasks, statistics were recalculated and credits were assigned.
The following text has been added to the captions for such tasks:
Validation rechecked, correct credit, calculated, fixed and assigned. v.1.0
(Example: https://boinc.termit.me/adsl/result.php?resultid=2688050)

"Batch" - "Count err":
103 - 1
111 - 19
113 - 6
115 - 99
117 - 14
119 - 211
121 - 42
123 - 18
125 - 161
129 - 114
131 - 54
133 - 16

The cause of these errors was resolved last week.
17) Questions and Answers : Server any other problems : Some data has been corrected (Message 672)
Posted 15 Apr 2024 by Demis
Post:
Some data has been corrected today:
tpt k=24 - is 0 tuples corrected
tpt k=22 - is 12 tuples corrected
tpt k=20 - is 123 tuples corrected
tpt k=18 - is 2604 tuples corrected
tpt k=16 - is 47799 tuples corrected
tpt total = 50538 tuples corrected

The cause of the errors was in the assimilator.
This has now been resolved.

Data analysis work will continue.
18) Questions and Answers : Automatically generated server job : SPT task (Message 671)
Posted 13 Apr 2024 by Demis
Post:
Batch 136: 9911328384525935453 begin: 9911328379391935453, end: 9911328389659935453
Count: 1
Make overlap from -5134000000 and +5134000000 . This is special wu created for overlap_133_135
19) Questions and Answers : Automatically generated server job : SPT task (Message 670)
Posted 13 Apr 2024 by Demis
Post:
Batch 135: 9661728384525935453 .. 9911328384525935453 -1
Count: 128000
Continue from 9,66E+18
20) Message boards : Number crunching : Unresponsive computer (Message 666)
Posted 3 Apr 2024 by Demis
Post:
Ok.
Fine!

All that remains is to find the right balance of resources for different projects.
This can be achieved through points 1. and 2. FAQ.

Since point 1 applies to a specific project (customized by the user in the web form of each project).
And point 2 operates in the boinc-client on a specific user’s computer.


Next 20

©2024 Natalia Makarova & Alex Belyshev & Tomáš Brada